GATE.ac.uk - releases/gate-8.1-build5169-ALL/doc/tao/splitch19.html

Chapter 19
Machine Learning [#]

This chapter presents machine learning PRs available in GATE. Currently, two PRs are available:

The Batch Learning PR (in the Learning plugin) is GATE’s most comprehensive and developed machine learning oﬀering. It is speciﬁcally targetted at NLP tasks including text classiﬁcation, chunk learning (e.g. for named entity recognition) and relation learning. It integrates LibSVM for improved speed, along with the PAUM algorithm, oﬀering very competitive performance and speed. It also oﬀers a Weka interface. It is documented in Section 19.2.
The Machine Learning PR (in the Machine_Learning plugin) is GATE’s older machine learning oﬀering. It oﬀers wrappers for Maxent, Weka and SVM Light. It is documented in Section 19.3.

To use GATE in conjunction with machine learning technologies that are not supported by the two PRs described here, you would need to export your data from GATE to use with the ML technology outside of GATE. One possibility for doing that would be to use the Conﬁgurable Exporter PR described in Section 23.14. The Batch Learning PR also oﬀers data export functionality.

The rest of the chapter is organised as follows. Section 19.1 introduces machine learning in general, focusing on the terminology used and the meaning of the terms within GATE. We then move on to describe the two Machine Learning processing resources, beginning with the Batch Learning PR in Section 19.2. Section 19.2.1 describes all the conﬁguration settings of the Batch Learning PR one by one; i.e. all the elements in the conﬁguration ﬁle for setting the Batch Learning PR (the learning algorithm to be used and the options for learning) and deﬁning the NLP features for the problem. Section 19.2.2 presents three case studies with example conﬁguration ﬁles for the three types of NLP learning problems. Section 19.2.3 lists the steps involved in using the Batch Learning PR. Finally, Section 19.2.4 explains the outputs of the Batch Learning PR for the four usage modes; namely training, application, evaluation and producing feature ﬁles only, and in particular, the format of the feature ﬁles and label list ﬁle produced by the Batch Learning PR. Section 19.3 outlines the original Machine Learning PR in GATE.

19.1 ML Generalities [#]

There are two main types of ML; supervised learning and unsupervised learning. Supervised learning is more eﬀective and much more widely used in NLP. Classiﬁcation is a particular example of supervised learning, in which the set of training examples is split into multiple subsets (classes) and the algorithm attempts to distribute new examples into the existing classes. This is the type of ML that is used in GATE, and all further references to ML actually refer to classiﬁcation.

An ML algorithm ‘learns’ about a phenomenon by looking at a set of occurrences of that phenomenon that are used as examples. Based on these, a model is built that can be used to predict characteristics of future (unseen) examples of the phenomenon.

An ML implementation has two modes of functioning: training and application. The training phase consists of building a model (e.g. a statistical model, a decision tree, a rule set, etc.) from a dataset of already classiﬁed instances. During application, the model built during training is used to classify new instances.

Machine Learning in NLP falls broadly into three categories of task type; text classiﬁcation, chunk recognition, and relation extraction

Text classiﬁcation classiﬁes text into pre-deﬁned categories. The process can be equally well applied at the document, sentence or token level. Typical examples of text classiﬁcation might be document classiﬁcation, opinionated sentence recognition, POS tagging of tokens and word sense disambiguation.
Chunk recognition often consists of two steps. First, it identiﬁes the chunks of interest in the text. It then assigns a label or labels to these chunks. However some problems comprise simply the ﬁrst step; identifying the relevant chunks. Examples of chunk recognition include named entity recognition (and more generally, information extraction), NP chunking and Chinese word segmentation.
Relation extraction determines whether or not a pair of terms in the text has some type(s) of pre-deﬁned relations. Two examples are named entity relation extraction and co-reference resolution.

Typically, the three types of NLP learning use diﬀerent linguistic features and feature representations. For example, it has been recognised that for text classiﬁcation the so-called tf − idf representation of n-grams is very eﬀective (e.g. with SVM). For chunk recognition, identifying the start token and the end token of the chunk by using the linguistic features of the token itself and the surrounding tokens is eﬀective and eﬃcient. Relation extraction beneﬁts from both the linguistic features from each of the two terms involved in the relation and the features of the two terms combined.

The rest of this section explains some basic deﬁnitions in ML and their speciﬁcation in the ML plugin.

19.1.1 Some Deﬁnitions

instance: an example of the studied phenomenon. An ML algorithm learns a model from a set of known instances, called a (training) dataset. It can then apply the learned model to another (application) dataset.
attribute: a characteristic of the instances. Each instance is deﬁned by the values of its attributes. The set of possible attributes is well deﬁned and is the same for all instances in the training and application datasets. ‘Feature’ is also often used. However, in this context, this can cause confusion with GATE annotation features.
class: an attribute for which the values are available in the training dataset for learning, but which are not present in the application dataset. ML is used to ﬁnd the value of this attribute in the application dataset.

19.1.2 GATE-Speciﬁc Interpretation of the Above Deﬁnitions

instance: an annotation. In order to use ML in GATE, users will need to choose the type of annotations used as instances. Token annotations are a good candidate for many NLP learning tasks such as information extraction and POS tagging, but any type of annotation could be used (e.g. things that were found by a previously run JAPE grammar, such as sentence annotations and document annotations for sentence and document classiﬁcation respectively).
attribute: an attribute is the value of a named feature of a particular annotation type, which can either (partially) cover the instance annotation considered or another instance annotation which is related to the instance annotation considered. The value of the attribute can refer to the current instance or to an instance either situated at a speciﬁed location relative to the current instance or having special relation with the current instance.
class: any attribute referring to the current instance can be marked as class attribute.

19.2 Batch Learning PR [#]

This section describes the newest machine learning PR in GATE. The implementation focuses on the three main types of learning in NLP, namely chunk recognition (e.g. named entity recognition), text classiﬁcation and relation extraction. The implementation for chunk recognition is based on our work using support vector machines (SVM) for information extraction [Li et al. 05c]. The text classiﬁcation is based on our work on opinionated sentence classiﬁcation and patent document classiﬁcation (see [Li et al. 07c] and [Li et al. 07d], respectively). The relation extraction is based on our work on named entity relation extraction [Wang et al. 06].

The Batch Learning PR, given a set of documents, can also produce feature ﬁles, containing linguistic features and feature vectors, and labels if there are any in the documents. It can also produce document-term matrices and n-gram based language models. Feature ﬁles are in text format and can be used outside of GATE. Hence, users can use GATE-produced feature ﬁles oﬀ-line, for their own purpose, e.g. evaluating new learning algorithms.

The PR also provides facilities for active learning, based on support vector machines (SVM), mainly ranking the unlabelled documents according to the conﬁdence scores of the current SVM models for those documents.

The primary learning algorithm implemented is SVM, which has achieved state of the art performances for many NLP learning tasks. The training of SVM uses a Java version of the SVM package LibSVM [Chang & Lin 11]. Application of SVM is implemented by ourselves. The PAUM (Perceptron Algorithm with Uneven Margins) is also included [Li et al. 02], and on our test datasets has consistently produced a performance to rival the SVM with much reduced training times. Moreover, the ML implementation provides an interface to the open-source machine learning package Weka [Witten & Frank 99], and can use machine learning algorithms implemented in Weka. Three widely-used learning algorithms are available in the current implementation: Naive Bayes, KNN and the C4.5 decision tree algorithm.

Access to ML implementations is provided in GATE by the ‘Batch Learning PR’ (in the ‘learning’ plugin). The PR handles training and application of an ML model, evaluation of learning on GATE documents, producing feature ﬁles and ranking documents for Active Learning. It also makes it possible to view the primal forms of a linear SVM. This PR is a Language Analyser so it can be used in all default types of GATE controllers.

In order to use the Batch Learning processing resource, the user has to do three things. First, the user has to annotate some training documents with the labels that s/he wants the learning system to annotate in new documents. Those label annotations should be GATE annotations. Secondly, the user may need to pre-process the documents to obtain linguistic features for the learning. Again, these features should be in the form of GATE annotations. GATE’s plugin ANNIE might be helpful for producing the linguistic features. Other resources such as the NP Chunker and parser may also be helpful. By providing the machine learning algorithm with more and better information on which to base learning, chances of a good result are increased, so this preprocessing stage is important. Finally the user has to create a conﬁguration ﬁle for setting the ML PR, e.g. selecting the learning algorithm and deﬁning the linguistic features used in learning. Three example conﬁguration ﬁles are presented in this section; it might be helpful to take one of them as a starting point and modify it.

19.2.1 Batch Learning PR Conﬁguration File Settings [#]

In order to allow for more ﬂexibility, all conﬁguration parameters for the PR are set through one external XML ﬁle, except for the learning mode, which is selected through normal PR parameterisation. The XML ﬁle contains both the conﬁguration parameters of the Batch Learning PR itself and of the linguistic data (namely the deﬁnitions of the instance and attributes) used by the Batch Learning PR. The XML ﬁle is speciﬁed when creating a new Batch Learning PR.

The parent directory of the XML conﬁguration ﬁle becomes the working directory. A subdirectory in the working directory, named ‘savedFiles’, will be created (if it does not already exist). All the ﬁles produced by the Batch Learning PR, including the NLP features ﬁles, label list ﬁle, feature vector ﬁle and learned model ﬁle, will be stored in that subdirectory. A log ﬁle recording the learning session is also created in this directory.

Below, we ﬁrst describe the parameters of the Batch Learning PR. Then we explain those settings speciﬁed in the conﬁguration ﬁle.

PR Parameters: Settings not Speciﬁed in the Conﬁguration File [#]

For the sake of convenience, a few settings are not speciﬁed in the conﬁguration ﬁle. Instead the user should specify them as initialization or run-time parameters of the PR, as in other PRs.

URL (or path and name) of the conﬁguration ﬁle. The user is required to give the URL of the conﬁguration ﬁle when creating the PR. The conﬁguration ﬁle should be in XML format with the extension name .xml. It contains most of learning settings and will be explained in detail in the next subsection.
Corpus. This is a run-time parameter, meaning that the user should specify it after creating the PR, and may change it between runs. The corpus contains the documents that the PR will use as learning data (training or application). For application, the documents should include all the annotations speciﬁed in the conﬁguration ﬁle, except the class attribute. The annotations for class attribute should be available in the documents used for training or evaluation.
inputASName is the annotation set containing the annotations for the linguistic features to be used and the class labels.
outputASName is the annotation set in which the results of applying the models will be put. Note that it should be set the same as the inputASName when doing the evaluation (i.e. setting the learningMode as ‘EVALUATION’).
learningMode is a run-time parameter. It can be set as one of the following values, ‘TRAINING’, ‘APPLICATION’, ‘EVALUATION’, ‘ProduceFeatureFilesOnly’, ‘MITRAINING’, ‘VIEWPRIMALFORMMODELS’ and ‘RankingDocsForAL’. The default learning mode is ‘TRAINING’.
- In TRAINING mode, the PR learns from the data provided and saves the models into a ﬁle called ‘learnedModels.save’ under the sub-directory ‘savedFiles’ of the working directory.
- If the user wants to apply the learned model to the data, s/he should select APPLICATION mode. In application mode, the PR reads the learned model from the ﬁle ‘learnedModels.save’ in the subdirectory ‘savedFiles’ and then applies the model to the data.
- In EVALUATION mode, the PR will do k-fold or hold-out test set evaluation on the corpus provided (the method of the evaluation is speciﬁed in the conﬁguration ﬁle, see below), and output the evaluation results to the messages window of GATE Developer, or standard out when using GATE Embedded, and into the log ﬁle. When using evaluation mode, please make sure that the outputASName is set to the same annotation set as the inputASName.
- If the user only wants to produce feature data and feature vectors but does not want to train or apply a model, s/he may select the ProduceFeatureFilesOnly mode. The feature ﬁles that the PR produces will be explained in detail in Section 19.2.4.
- In MITRAINING (mixed initiative training) mode, the training data are appended to the end of any existing feature ﬁle. In contrast, in training mode, the training data created in the current session overwrite any existing feature ﬁle. Consequently, mixed initiative training mode uses both the training data obtained in this session and the data that existed in the feature ﬁle before starting the session. Hence, training mode is for batch learning, while mixed initiative training mode can be used for on-line (or adaptive, or mixed-initiative) learning. There is one parameter for mixed initiative training mode specifying the minimal number of newly added documents before starting the learning procedure to update the learned model. The parameter can be deﬁned in the conﬁguration ﬁle.
- VIEWPRIMALFORMMODELS mode is used for displaying the most salient NLP features in the learned models. In the current implementation, the mode is only valid with the linear SVM model, in which the most salient NLP features correspond to the biggest (absolute values of) weights in the weight vector. In the conﬁguration ﬁle one can specify two parameters to determine the number of displayed NLP features for positive and negative weights. Note that if e.g. the number for negative weight is set as 0, then no NLP feature is displayed for negative weights.
- RankingDocsForAL applies the current learned SVM models (in the sub-directory ‘savedFiles’) to the feature vectors stored in the ﬁle ‘fvsDataSelecting.save’ in the sub-directory ‘savedFiles’ and ranks the documents according to the margins of the examples in one document to the SVM models. The ranked list of documents will be put into the ﬁle ‘ALRankedDocs.save’.
runProtocolDir: if speciﬁed, the URL of a directory where a protocol ﬁle for an evaluation run (EVALUATION mode) will be placed. The protocol ﬁle is an XML ﬁle that contains the results of the evaluation run plus the original conﬁg ﬁle used for the evaluation. The name of the protocol ﬁle is derived from the ﬁle name of the conﬁguration ﬁle by replacing the xml extension with _evaluation_YYYYMMDD_HHMMSS.xml where YYYYMMDD is the date and HHMMSS is the time the evaluation was run.

In most cases it is not safe to run more than one instance of the batch learning PR with the same working directory at the same time, because the PR needs to update the model (in TRAINING, MITRAINING or EVALUATION mode) or other data ﬁles. It is safe to run multiple instances at once provided they are all in APPLICATION mode¹.

Order of document processing In the usual case, in a GATE corpus pipeline application, documents are processed one at a time, and each PR is applied in turn to the document, processing it fully, before moving on to the next document. The Batch Learning PR breaks from this rule. ML training algorithms, including SVM, typically run as a batch process over a training set, and require all the data to be fully prepared and passed to the algorithm in one go. This means that in training (or evaluation) mode, the Batch Learning PR will wait for all the documents to be processed and will then run as a single operation at the end. Therefore, the Batch Learning PR needs to be positioned last in the pipeline. Post-processing cannot be done within the pipeline after the Batch Learning PR. Where further processing needs to be done, this should take the form of a separate application, and be applied to the data afterwards.

There is an exception to the above, however. In application mode, the situation is slightly diﬀerent, since the ML model has already been created, and the PR only applies it to the data. This can be done on a document by document basis, in the manner of a normal PR. However, although it can be done document by document, there may be advantages in terms of eﬃciency to grouping documents into batches before applying the algorithm. A parameter in the conﬁguration ﬁle, BATCH-APP-INTERVAL, described later, allows the user to specify the size of such batches, and by default this is set to 1; in other words, by default, the Batch Learning PR in application mode behaves like a normal PR and processes each document separately. There may be substantial eﬃciency gains to be had through increasing this parameter (although higher values require more memory consumption), but if the Batch Learning PR is applied in application mode and the parameter BATCH-APP-INTERVAL is set to 1, the PR can be treated like any other, and other PRs may be positioned after it in a pipeline.

Settings in the Batch Learning PR XML Conﬁguration File [#]

The root element of the XML conﬁguration ﬁle needs to be called ‘ML-CONFIG’, and it must contain two basic elements; DATASET and ENGINE, and optionally other settings. In the following, we ﬁrst describe the optional settings, then the ENGINE element, and ﬁnally the DATASET element. In the next section, some examples of the XML conﬁguration ﬁle are given for illustration. Please also refer to the conﬁguration ﬁles in the test directory (i.e. plugs/learning/test/ under the main gate directory) for more examples.

Optional Settings in the Conﬁguration File The Batch Learning PR provides a variety of optional settings, which facilitate diﬀerent tasks. Every optional setting has a default value; if an optional setting is not speciﬁed in the conﬁguration ﬁle, the Batch Learning PR will adopt its default value. Each of the following optional settings can be set as an element in the XML conﬁguration ﬁle.

SURROUND should be set to ‘true’ if the user wants the Batch Learning PR to learn chunks by identifying the start token and the end token of the chunk. This approach to chunk learning, for example, named entity recognition, where a span of several tokens is to be identiﬁed, often produces better results than trying to learn every token in the chunk. For classiﬁcation problems and relation extraction, set its value as ‘false’. This element appears in the conﬁguration ﬁle as:
<SURROUND VALUE=’X’/>
where the variable X has two possible values: ‘true’ or ‘false’. The default value is ‘false’.
FILTERING relates to SVM training. Where the ratio of positive examples to negative examples is low, i.e. the instances belonging in the class are much outweighed by instances outside of the class (e.g. ‘one against others’ is used, see multiClassiﬁcation2Binary below) SVMs can run into diﬃculties. The positive examples may be swamped by outlying negative examples. The ML plugin provides functionality developed through research (e.g. [Li & Bontcheva 08]) to assist in such cases. One example is the FILTERING parameter. The ﬁltering functionality performs initial SVM training, then removes negative examples on the basis of their position relative to the separator. It then retrains on the smaller dataset. Typically, negative instances close to the boundary are removed. Note that this two-step process takes longer than simple training. However, the second training step will be quicker than the ﬁrst, as it is performed on a somewhat reduced dataset. If the item dis is set as ‘near’, the PR selects and removes those negative examples which are closest to the SVM hyper-plane. If it is set as ‘far’, those negative examples that are furthest from the SVM hyper-plane are removed. The value of the item ratio determines what proportion of negative examples will be ﬁltered out. This element appears in the conﬁguration ﬁle as:
< FILTERING ratio=’X’ dis=’Y’/>
where X represents a number between 0 and 1 and Y can be set as ‘near’ or ‘far’. If the ﬁltering element is not present in the conﬁguration ﬁle, or the value of ratio is set as 0.0, the PR does not perform ﬁltering. The default value of ratio is 0.0. The default value of dis is ‘far’.
EVALUATION As outlined above, if the learning mode parameter learningMode is set to ‘EVALUATION’, the PR will perform evaluation of the ML model; it will split the documents in the corpus into two parts, the training dataset and the test dataset, learn a model from the training dataset, apply the model to the testing dataset, and ﬁnally compare the annotations assigned by the model on the test set with the true annotations and output measures of success (e.g. F-measure). The evaluation element speciﬁes the method of splitting the corpus. The item method determines which method to use for evaluation. Currently two commonly used methods are implemented, namely k-fold cross-validation and hold-out test. In k-fold cross-validation the PR segments the corpus into k partitions of equal size, and uses each of the partitions in turn as a test set, with all the remaining documents as a training set. For hold-out test, the system randomly selects some documents as testing data and uses all other documents as training data. The value of the item runs speciﬁes the number ‘k’ for k-fold cross-validation. The value of the item ratio speciﬁes the ratio of the data used for training in the hold-out test method. The element in the conﬁguration ﬁle appears as so:
<EVALUATION method="X" runs="Y" ratio="Z"/>
where the variable X has two possible values ‘kfold’ and ‘holdout’, Y is a positive integer, and Z is a ﬂoat number between 0 and 1. The default value of method is ‘holdout’. The default value of runs is ‘1’. The default value of ratio is ‘0.66’.
multiClassiﬁcation2Binary. Certain machine learning algorithms, including SVM, are designed to operate on two class problems; they ﬁnd a separator between two groups of instances. In order to use such algorithms to classify items into a larger number of classes, the problem has to be converted into a series of ‘binary’ (two class) problems. The ML plugin implements two common methods for converting a multi-class problem into several binary problems, namely one against others and one against another. The two methods may have slightly diﬀerent names in other publications, but the principle is the same. Suppose we have a multi-class classiﬁcation problem with n classes. For the one against others method, one binary classiﬁcation problem is derived for each of the n classes. Examples belonging to the class in question are considered to be positive examples and all other examples in the training set are negative examples. In contrast, for the one against another method, one binary classiﬁcation problem is derived for each pair (c1,c2) of the n classes. Training examples belonging to the class c1 are the positive examples and those belonging to the other class, c2, are the negative examples. The user can select one of the two methods by specifying the value of the item method of the element. The element appears as so:
<multiClassiﬁcation2Binary method="X" thread-pool-size="N"/>
where the variable X has two values, ‘one-vs-others’ and ‘one-vs-another’. Note that depending on the sample size, the two methods may diﬀer greatly in their speed of execution. The default method is the one-vs-others method. If the conﬁguration ﬁle does not have the element or the item method is missed, then the PR will use the one-vs-others method. Since the derived binary classiﬁers are independent it is possible to learn several of them in parallel. The ‘thread-pool-size’ attribute gives the number of threads that will be used to learn and apply the binary classiﬁers. If omitted, a single thread will be used to process all the classiﬁers in sequence.
thresholdProbabilityBoundary sets a conﬁdence threshold on start and end tokens for chunk learning. It is used in post-processing the learning results. Only those boundary tokens in which the conﬁdence level is above the threshold are selected as candidates for the entities. The element in conﬁguration ﬁle appears as so:
<PARAMETER name="thresholdProbabilityBoundary" value="X"/>
The value X is between 0 and 1. The default value is 0.4.
thresholdProbabilityEntity sets a conﬁdence threshold on chunks (which is the multiplication of the probabilities of the start token and end token of the chunk) for chunk learning. Only those entities in which the conﬁdence level is above the threshold are selected as candidates of the entities. The element in conﬁguration ﬁle appears as so:
<PARAMETER name="thresholdProbabilityEntity" value="X"/>
The value X is between 0 and 1. The default value is 0.2.
The threshold parameter thresholdProbabilityClassiﬁcation is the conﬁdence threshold for classiﬁcation (e.g. text classiﬁcation and relation extraction tasks. In contrast, the above two probabilities are for the chunking recognition task.) The corresponding element in conﬁguration ﬁle appears as so:
<PARAMETER name="thresholdProbabilityClassiﬁcation" value="X"/>
The value X is between 0 and 1. The default value is 0.5.
IS-LABEL-UPDATABLE is a Boolean parameter. If its value is set to ‘true’, the label list is updated from the labels in the training data. Otherwise, a pre-deﬁned label list will be used and cannot be updated from the training data. The conﬁguration element appears as so:
<IS-LABEL-UPDATABLE value="X"/>
The value X is ‘true’ or ‘false’. The default value is ‘true’.
IS-NLPFEATURELIST-UPDATABLE is a Boolean parameter. If its value is set to ‘true’, the NLP feature list is updated from the features in the training or application data. Otherwise, a pre-deﬁned NLP feature list will be used and cannot be updated. The conﬁguration element appears as so:
<IS-NLPFEATURELIST-UPDATABLE value="X"/>
The value X is ‘true’ or ‘false’. The default value is ‘true’.
The parameter VERBOSITY speciﬁes the verbosity level of the output of the system, both to the message window of GATE Developer (or standard out when using GATE Embedded) and into the log ﬁle. Currently there are four verbosity levels. MINIMUM (0) only allows the output of warning messages. NORMAL (1) outputs some important setting information and the results for evaluation mode. DEBUG (2) is used for debugging purposes. The special level NONE disables all logging, and does not even attempt to open the log ﬁle. This level should be used when running the learning PR in a production system (the PR will run much faster with logging disabled), and in particular must be used when running in a multithreaded system (such as GATECloud.net). The element in the conﬁguration ﬁle appears as so:
<VERBOSITY level="X"/>
The value X can be set as NONE, MINIMUM, NORMAL or DEBUG (for backwards compatibility the numeric equivalents -1, 0, 1 or 2 may be used instead). The default value is NORMAL.
MI-TRAINING-INTERVAL speciﬁes the minimal number of newly added documents needed to trigger retraining the model. This parameter is used in MITRAINING. The number is speciﬁed by the value of the feature ‘num’ as so:
<MI-TRAINING-INTERVAL num="X"/>
The default value of X is 1.
BATCH-APP-INTERVAL is used in application mode, and speciﬁes the number of documents to be collected and passed as a batch for classiﬁcation. Please refer to Section 19.2.1 for a detailed explanation of this option. The corresponding element in the conﬁguration ﬁle is:
<BATCH-APP-INTERVAL num="X"/>
The default value of X is 1.
DISPLAY-NLPFEATURES-LINEARSVM relates to ‘VIEWPRIMALFORMMODELS’ mode. In this mode, the most signiﬁcant features are displayed for each class. For more information about this mode see Section 19.2.1. Two numbers are speciﬁed; the number of positively weighted features to display and the number of negatively weighted features to display. It has the following form in the conﬁguration ﬁle;
<DISPLAY-NLPFEATURES-LINEARSVM numP="X" numN="Y"/>
where X and Y represent the numbers of positively and negatively weighted features to display, respectively. The default values of X and Y are 10 and 0.
ACTIVELEARNING speciﬁes the settings for active learning. Active learning ranks documents based on the average of a sample of ML annotation conﬁdence scores. A larger sample gives a more accurate ranking but takes longer to calculate. The option has the following form:
<ACTIVELEARNING numExamplesPerDoc=’X’/>
where X represents the number of examples per document used to obtain the conﬁdence score with respect to the learned model. The default value of numExamplesPerDoc is 3.

The ENGINE Element The ENGINE element speciﬁes which ML algorithm will be used, and also allows the options to be set for that algorithm.

For SVM learning, the user can choose one of two learning engines. We will discuss the two SVM learning engines below. Note that only linear and polynomial kernels are supported. This is despite the fact that the original SVM packages implemented other types of kernel. Linear and polynomial kernels are popular in natural language learning, and other types of kernel are rarely used. However, if you want to experiment with other types of kernel, you can do so by ﬁrst running the Batch Learning PR in GATE to produce the training and testing data, then using the data with the SVM implementation outside of GATE.

The conﬁguration ﬁles in the test directory (i.e. plugins/learning/test/ under the main gate directory) contain examples for setting the learning engine.

The ENGINE element in the conﬁguration ﬁle is speciﬁed as follows:
<ENGINE nickname=’X’ implementationName=’Y’ options=’Z’/>

It has three items:

nickname can be the name of the learning algorithm or whatever the user wants it to be.
implementationName refers to the implementation of the particular learning algorithm that the user wants to use. Its value should be one of the following:
- SVMLibSvmJava, the binary classiﬁcation SVM algorithm implemented in the Java version of the SVM package LibSVM.
- SVMExec, a binary SVM implementation of your choice, potentially in a language other than Java, run as a separate process outside of GATE. Currently it can use the SV M^light SVM package²; see the XML ﬁle in the GATE distribution (at gate/plugins/learning/test/chunklearning/engines-svm-svmlight.xml) for an example of how to specify the learning engine to be used. The learning engines SVMExec and SVMLibSvmJava should produce the same results in theory but may get slightly diﬀerent results in practice due to implementational diﬀerences. SVMLibSvmJava tends to be faster than SVMExec for smaller training sets. There may be cases where it is an advantage to run SVM as a separate process however, in which case, SVMExec would be preferable.
- PAUM, the Perceptron with uneven margins, a simple and fast classiﬁcation learning algorithm. (For details about the learning algorithm PAUM, see [Li et al. 02]).
- PAUMExec, a binary PAUM implementation of your choice, potentially in a language other than Java, run as a separate process outside of GATE. The relationship between the PAUM and PAUMExec is similar to that of SVMLibSvmJava and SVMExec. You may download and use an implementation in C from http://www.dcs.shef.ac.uk/∼yaoyong/paum/paum-learning.zip. See the XML ﬁle in the GATE distribution (at gate/plugins/learning/test/chunklearning/engines-paum-exec.xml) for an example of how to specify the learning engine to be used.
- NaiveBayesWeka, the Naive Bayes learning algorithm implemented in Weka.
- KNNWeka, the K nearest neighbour (KNN) algorithm implemented in Weka.
- C4.5Weka, the decision tree algorithm C4.5 implemented in Weka.
Options: the value of this item, which is dependent on the particular learning algorithm, will be passed verbatim to the ML engine used. Where an option is absent, defaults for that engine will be used.
- The options for SVMLibSvmJava are similar to those for LibSVM but with the exception that since SVMLibSvmJava implements the uneven margins SVM algorithms described in [Li & Shawe-Taylor 03], it takes the uneven margins parameter as an option. SVMLibSvmJava options are as follows:
  - -s svm_type; whether the SVM should be binary or multiclass. Default value is 0. Since only binary is supported, the option should be set to 0 or excluded.
  - -t kernel_type; 0 for a linear kernel or 1 for a polynomial kernel. Default value is 0. Note that the current implementation does not support other kernel types such as radial and sigmoid function.
  - -d degree; the degree in polynomial kernel, e.g. 2 for quadratic kernel. Default value is 3.
  - -c cost; the cost parameter C in the SVM. Default value is 1. This parameter determines the cost associated with allowing training errors (‘soft margins’). Allowing some points to be misclassiﬁed by the SVM may produce a more generalizable result.
  - -m cachesize; the cache memory size in MB (default 100).
  - -tau value; setting the value of uneven margins parameter of the SVM. τ = 1 corresponds to the standard SVM. If the training data has just a small number of positive examples and a large number of negative examples, setting the parameter τ to a value less than 1 (e.g. τ = 0.4) often results in better F-measure than the standard SVM (see [Li & Shawe-Taylor 03]).
- The options for SVMExec, using SV M^light, are similar to those for using SV M^light directly for training. Options set the type of kernel, the parameters in the kernel function, the cost parameter, the memory used, etc. The parameter tau is also included, to set the uneven margins parameter, as explained above. The last two terms in the parameter options are the training data ﬁle and the model ﬁle. An example of the options for SVMExec might be ‘-c 0.7 -t 0 -m 100 -v 0 -tau 0.6 /yaoyong/software/svm-light/data_svm.dat /yaoyong/software/svm-light/model_svm.dat’, meaning that the learner uses a linear kernel, the uneven margins parameter is set as 0.6, and two data ﬁles /yaoyong/software/svm-light/data_svm.dat and /yaoyong/software/svm-light/model_svm.dat for writing and reading data. Note that both the data ﬁles speciﬁed here are temporary ﬁles, which are used only by the svm-light training program, can be in anywhere in your computer, and are independent of the data ﬁles produced by the GATE learning plugin. SVMExec also takes a further argument, executableTraining, which speciﬁes the SVM learning program svm_learn.exe in the SV M^light. For example, executableTraining=‘/yaoyong/software/svm-light/svm_learn.exe’ speciﬁes one particular svm_learn.exe obtained from the package SV M^light.
- The PAUM engine has three options; ‘-p’ for the positive margin, ‘-n’ fo the negative margin, and ‘-optB’ for the modiﬁcation of the bias term. For example, options=‘-p 50 -n 5 -optB 0.3’ means τ₊ = 50, τ₋ = 5 and b = b + 0.3 in the PAUM algorithm.
- The KNN algorithm has one option; the number of neighbours used. It is set via ‘-k X’. The default value is 1.
- There are no options for Naive Bayes and C4.5 algorithms.

The DATASET Element The DATASET element deﬁnes the type of annotation to be used as training instance and the set of attributes that characterise the instances. The INSTANCE-TYPE sub-element is used to select the annotation type to be used for instances. There will be one training instance for every one of the instance annotations in the corpus. For example, if INSTANCE-TYPE has ‘Token’ as its value, there will be one training instance in the document per token. This also means that the positions (see below) are deﬁned in relation to tokens. INSTANCE-TYPE can be seen as the basic unit to be taken into account for machine learning. The attributes of the instance are deﬁned by a sequence of ATTRIBUTE, ATTRIBUTE_REL or ATTRIBUTELIST elements.

Diﬀerent NLP learning tasks may have diﬀerent instance types and use diﬀerent kinds of attribute elements. Chunking recognition often uses the token as instance type and the linguistic features of ‘Token’ and other annotations as features. Text classiﬁcation’s instance type is the text unit for classiﬁcation, e.g. the whole document, or sentence, or token. If classifying for example a sentence, n-grams (see below) are often a good feature representation for many statistical learning algorithms. For relation extraction, the instance type is a pair of terms that may be related, and the features come from not only the linguistic features of each of the two terms but also those related to both terms taken together.

The DATASET element should deﬁne an INSTANCE-TYPE sub-element, it should deﬁne an ATTRIBUTE sub-element or an ATTRIBUTE_REL sub-element as class, and it should deﬁne some linguistic feature related sub-elements (‘linguistic feature’ or ‘NLP feature’ is used here to distinguish features or attributes used for machine learning from features in the sense of a feature of a GATE annotation). All the annotation types involved in the dataset deﬁnition should be in the same annotation set. Each of the sub-elements deﬁning the linguistic features (attributes) should contain an element deﬁning the annotation TYPE to be used and an element deﬁning the FEATURE of the annotation type to use. For instance, TYPE might be ‘Person’ and FEATURE might be ‘gender’. For an ATTRIBUTE sub-element, if you do not specify FEATURE, the entire sub-element will be ignored. Therefore, if an annotation type you want to use does not have any annotation features, you should add an annotation feature to it and assign the same value to the feature for all annotations of that type. Note that if blank spaces are contained in the values of the annotation features, they will be replaced by the character ‘_’ in each occurrence. So it is advisable that the values of the annotation features used, in particular for the class label, do not contain any blank space.

Below, we explain all the sub-elements one by one. Please also refer to the example conﬁguration ﬁles presented in next section. Note that each sub-element should have a unique name, if it requires a name, unless we explicitly state otherwise.

The INSTANCE-TYPE sub-element is deﬁned as
<INSTANCE-TYPE>X</INSTANCE-TYPE> where X is the annotation type used as instance unit for learning, for example ‘Token’. For relation extraction, the user should also specify the two arguments of the relation, as so:
<INSTANCE-ARG1>A</INSTANCE-ARG1>
<INSTANCE-ARG2>B</INSTANCE-ARG2>
The values of A and B should be identiﬁers for the ﬁrst and second terms of the relation, respectively. These names will be used later in the conﬁguration ﬁle. An example can be found at /gate/plugins/learning/test/relation-learning/engines-svm.xml.
An ATTRIBUTE element has the following sub-elements:
- NAME; the name of the attribute. Its value should not end with ‘gram’, since this is reserved for n-gram features as mentioned below. This attribute name will appear in output ﬁles, so it is useful to give a descriptive name.
- SEMTYPE; type of the attribute value. It can be ‘NOMINAL’ or ‘NUMERIC’. Currently only nominal is supported.
- TYPE; the annotation type used to extract the attribute.
- FEATURE; the value of the attribute will be the value of the named feature on the annotation of the speciﬁed type.
- POSITION; the position of the instance annotation to be used for extracting the feature relative to the current instance annotation. 0 refers to the current instance annotation, -1 refers to the preceding instance annotation, 1 refers to the following one and so forth. Recall that we deﬁned INSTANCE-TYPE at the start of the DATASET element. This type might for example be ‘Token’. In the current ATTRIBUTE element we are deﬁning an annotation type to use to get the feature from, separate and possibly diﬀerent from the INSTANCE-TYPE. For example, we might be interested in the ‘majorType’ of a ‘Lookup’. By specifying -1, we would be saying, move to the preceding ‘Token’ and then try to extract the ‘majorType’ of the ‘Lookup’ on that token. The default value of the parameter is 0. Note that if our INSTANCE-TYPE were to be for example a named entity annotation comprising multiple tokens, and we wanted to extract a feature on the ‘Token’ annotation, then all the tokens within it would be considered to be in the zero position relative to the current instance annotation, and the current implementation would simply pick the ﬁrst. (Useful in this case might be the NGRAM attribute type, described later, which can be used to extract features for each member of a multi-token annotation.) In the current implementation, features are weighted according to their distance from the current instance annotation. In other words, features which are further removed from the current instance annotation are given reduced importance. The component value in the feature vector for one attribute feature is 1 if the attribute’s position p is 0. Otherwise its value is 1.0∕|p|.
- <CLASS/>: an empty element used to mark the class attribute. There can only be one attribute marked as class in a dataset deﬁnition. The attribute, as described above, has speciﬁed TYPE and FEATURE; the features of the type are the class labels. Since only one attribute can be marked as class, it may be necessary to preprocess your data to put all class labels into a feature of one type of annotation, e.g. you might create a ‘Mention’ annotation, with the feature ‘Class’, which is set to the class name.
The ATTRIBUTELIST element is similar to ATTRIBUTE except that it has no POSITION sub-element but instead a RANGE element. This will be converted into several attributes with position ranging from the value of ‘from’ to the value of ‘to’. It deﬁnes a ‘context window’ containing several consecutive examples. The ATTRIBUTELIST should be preferred when deﬁning a context window for features, because not only it can avoid the duplication of ATTRIBUTE elements, but also because processing is speeded up (see the discussion for the element WINDOWSIZE below).
The WINDOWSIZE element speciﬁes the size of the context window. This will override the context window size deﬁned in every ATTRIBUTELIST. If the WINDOWSIZE element is not present in the conﬁguration ﬁle, the window size deﬁned in each element ATTRIBUTELIST will be used; otherwise, the window size speciﬁed by this element will be used for each ATTRIBUTELIST if it contains one ATTRIBUTE at position 0 (otherwise the ATTRIBUTELIST will be ignored). This element can be used for speeding up the process of extracting the feature vectors from the documents. The element has two features specifying the length of left and right sides of context window. It has the following form:
<WINDOWSIZE windowSizeLeft="X" windowSizeRight="Y"/>
where X and Y represent the the length of left and right sides of context window, respectively. For example, if X = 2 and Y = 1, then the context window will be from the position -2 to 1 ( e.g. from the second token in the left through the current token to the ﬁrst token in the right).
An NGRAM feature is used for characterising an instance annotation in terms of constituent sequences of subsumed feature annotations. It is essentially a reversal of the ATTRIBUTELIST principle; where ATTRIBUTELIST uses a sequence surrounding an instance in order to classify the instance, NGRAM uses sequences within the instance as features. It simply creates a series of attributes that constitute a sliding window across the entire of the current instance annotation. For example, INSTANCE-TYPE might be sentences, in sentence classiﬁcation, and the NGRAM attribute speciﬁcation could be used for example to create a series of unigram features for the sentence, eﬀectively a ‘bag of words’ representation. Conventionally, one would use the string of the token, or perhaps its lemma, as the feature for the NGRAM; however, it is possible to specify multiple features of choice, as shown below.
- NAME; name of the n-gram. Its value should end with ‘gram’.
- NUMBER; the ‘n’ of the n-gram, with value 1 for unigram, and 2 for bigram, etc. Note that when using n-grams with n greater than 1, you should leave a unigram in the conﬁguration ﬁle also.
- CONSNUM; several features can be used to generate n-grams. For example, n-grams of token strings could be used as well as n-grams of lemmas. Where CONSNUM is ‘k’, the NGRAM element should have ‘k’ CONS-X sub-elements, where X= 1, ..., k. Each CONS-X element has one TYPE sub-element and one FEATURE sub-element, which deﬁne feature to be used for that term to create n-grams.
- The WEIGHT sub-element speciﬁes a weight for the n-gram feature. The n-gram part of the feature vector for one instance is normalised, thus having a default value of 1.0. If the user wants to adjust the contributions of the n-gram to the whole feature vector, s/he can do so by setting the WEIGHT parameter. For example, if the user is doing sentence classiﬁcation and s/he uses two features; the unigram of tokens in a sentence and the length of the sentence, by default the entire of the NGRAM attribute speciﬁcation is given only the same importance as the sentence length feature. In order to experiment with increasing the importance of the n-gram element, the user can set the weight sub-element of the n-gram element with a number bigger than 1.0 (like 10.0). Then every component of the n-gram part of the feature vector would be multiplied by the parameter.
The ValueTypeNgram element speciﬁes the type of value used in the n-gram. Currently it can take one of the three types; ‘binary, tf, and tf-idf, which are explained in Section 19.2.4. The value is speciﬁed by the X in
<ValueTypeNgram>X</ValueTypeNgram>
X = 1 for binary, = 2 for tf, and = 3 for tf-idf. The default value is 3.
The FEATURES-ARG1 element deﬁnes the features related to the ﬁrst argument of the relation for relation learning. It should include one ARG sub-element referring to the GATE annotation of the argument (see below for a detailed explanation). It may include other sub-elements, such as ATTRIBUTE, ATTRIBUTELIST and/or NGRAM, to deﬁne the linguistic features related to the argument. Features pertaining particularly to one or the other argument of a relation should be deﬁned in FEATURES-ARG1 or FEATURES-ARG2 as appropriate. Features relating to both arguments should be deﬁned using an ATTRIBUTE_REL.
The FEATURES-ARG2 element deﬁnes the features related to the second argument of relation. Like the element FEATURES-ARG1, it should include one ARG sub-element. It may also include other sub-elements. The ARG sub-element in the FEATURES-ARG2 should have a unique name which is diﬀerent from the name for the ARG sub-element in the FEATURES-ARG1. However, other sub-elements may have the same name as corresponding ones in the FEATURES-ARG1, if they refer to the same annotation type and feature in the text.
The ARG element is used in both FEATURES-ARG1 and FEATURES-ARG2. It speciﬁes the annotation corresponding to one argument of a relation. It has four sub-elements, as follows;
- NAME; a unique name for the argument (e.g. ‘ARG1’).
- SEMTYPE; the type of the arg value. This can be ‘NOMINAL’ or ‘NUMERIC’. Currently only nominal is implemented.
- TYPE; the annotation type for the argument.
- FEATURE; the value of the named feature on the annotation of speciﬁed type is the identiﬁer of the argument. Only if the value of the feature is same as the value of the feature speciﬁed in the sub-element <INSTANCE-ARG1>A</INSTANCE-ARG1> (or <INSTANCE-ARG2>B</INSTANCE-ARG2>), the argument is regarded as one argument of the relation instance considered.
ATTRIBUTE_REL element is similar to the ATTRIBUTE element. However, it does not have the POSITION sub-element, and it has two other sub-elements, ARG1 and ARG2, relating to the two argument features of the (relation) instance type. In other words, if and only if the value X in the sub-element <ARG1>X</ARG1> is same as the value A in the ﬁrst argument instance <INSTANCE-ARG1>A</INSTANCE-ARG1> and the value Y in the sub-element <ARG2>Y</ARG2> is same as the value B in the second argument instance <INSTANCE-ARG2>B</INSTANCE-ARG2> is the feature deﬁned in this ATTRIBUTE_REL sub-element assigned to the instance considered. For relation learning, an ATTRIBUTE_REL is denoted as the class attribute by including <CLASS/>.

19.2.2 Case Studies for the Three Learning Types [#]

The following are three illustrated examples of conﬁguration ﬁles for information extraction, sentence classiﬁcation and relation extraction. Note that the conﬁguration ﬁle is in the XML format, and should be stored in a ﬁle with the ‘.xml’ extension.

Information Extraction [#]

The ﬁrst example is for information extraction. The corpus is prepared with annotations providing class information as well as the features to be used. Class information is provided in the form of a single annotation type, ‘Mention’, which contains a feature ‘class’. Within the class feature is the name of the class of the textual chunk. Other annotations in the dataset include ‘Token’ and ‘Lookup’ annotations as provided by ANNIE. All of these annotations are in the same annotation set, the name of which will be passed as a runtime parameter.

The conﬁguration ﬁle is given below. The optional settings are in the ﬁrst part. It ﬁrst speciﬁes surround mode as ‘true’; we will ﬁnd the chunks that correspond to our entities by using machine learning to locate the start and end of the chunks. Then it speciﬁes the ﬁltering settings. Since we are going to use SVM in this problem, we can ﬁlter our data to remove some of the negative instances that can cause problems if they are too dominant. The ratio’s value is ‘0.1’ and the dis’s value is ‘near’, meaning that an initial SVM learning step will be executed and the 10% of negative examples which are closest to the learned SVM hyper-plane will be removed in the ﬁltering stage, before the ﬁnal learning is executed. The threshold probabilities for the boundary tokens and information entity are set as ‘0.4’ and ‘0.2’, respectively; boundary tokens found with a lower conﬁdence than the threshold will be rejected. The threshold probability for classiﬁcation is also set as ‘0.5’; this, however, will not be used in this case since we are doing chunk learning with surround mode set as ‘true’. The parameter will be ignored. multiClassiﬁcation2Binary is set as ‘one-vs-others’, meaning that the ML API will convert the multi-class classiﬁcation problem into a series of binary classiﬁcation problems using the one against others approach. In evaluation mode, ‘2-fold’ cross-validation will be used, dividing the corpus into two equal parts and running two training/test cycles with each part as the training data.

The second part is the sub-element ENGINE, specifying the learning algorithm. The PR will use the LibSVM SVM implementation. The options determine that it will use the linear kernel with the cost C as 0.7 and the cache memory as 100M. Additionally it will use uneven margins, with τ as 0.4.

The last part is the DATASET sub-element, deﬁning the linguistic features used. It ﬁrst speciﬁes the ‘Token’ annotation as instance type. The ﬁrst ATTRIBUTELIST allows the token’s string as a feature of an instance. The range from ‘-5’ to ‘5’ means that the strings of the current token instance as well as its ﬁve preceding tokens and its ﬁve ensuing tokens will be used as features for the current token instance. The next two attribute lists deﬁne features based on the tokens’ capitalisation information and types. The ATTRIBUTELIST named ‘Gaz’ uses as attributes the values of the feature ‘majorType’ of the annotation type ‘Lookup’. The ﬁnal ATTRIBUTE feature deﬁnes the class attribute; it has the sub-element <CLASS/>. The values of the feature ‘class’ of the annotation type ‘Mention’ are the class labels.

<?xml version="1.0"?>
<ML-CONFIG>
  <SURROUND value="true"/>
  <FILTERING ratio="0.1" dis="near"/>
  <PARAMETER name="thresholdProbabilityEntity" value="0.2"/>
  <PARAMETER name="thresholdProbabilityBoundary" value="0.4"/>
  <PARAMETER name="thresholdProbabilityClassification" value="0.5"/>
  <multiClassification2Binary method="one-vs-others"/>
  <EVALUATION method="kfold" runs="2"/>
  <ENGINE nickname="SVM" implementationName="SVMLibSvmJava"
        options=" -c 0.7 -t 0 -m 100 -tau 0.4  "/>
  <DATASET>
    <INSTANCE-TYPE>Token</INSTANCE-TYPE>
    <ATTRIBUTELIST>
       <NAME>Form</NAME>
       <SEMTYPE>NOMINAL</SEMTYPE>
       <TYPE>Token</TYPE>
       <FEATURE>string</FEATURE>
       <RANGE from="-5" to="5"/>
    </ATTRIBUTELIST>
    <ATTRIBUTELIST>
       <NAME>Orthography</NAME>
       <SEMTYPE>NOMINAL</SEMTYPE>
       <TYPE>Token</TYPE>
       <FEATURE>orth</FEATURE>
       <RANGE from="-5" to="5"/>
    </ATTRIBUTELIST>
    <ATTRIBUTELIST>
       <NAME>Tokenkind</NAME>
       <SEMTYPE>NOMINAL</SEMTYPE>
       <TYPE>Token</TYPE>
       <FEATURE>kind</FEATURE>
       <RANGE from="-5" to="5"/>
     </ATTRIBUTELIST>
     <ATTRIBUTELIST>
       <NAME>Gaz</NAME>
       <SEMTYPE>NOMINAL</SEMTYPE>
       <TYPE>Lookup</TYPE>
       <FEATURE>majorType</FEATURE>
       <RANGE from="-5" to="5"/>
     </ATTRIBUTELIST>
     <ATTRIBUTE>
        <NAME>Class</NAME>
        <SEMTYPE>NOMINAL</SEMTYPE>
        <TYPE>Mention</TYPE>
        <FEATURE>class</FEATURE>
        <POSITION>0</POSITION>
        <CLASS/>
     </ATTRIBUTE>
   </DATASET>
</ML-CONFIG>

Sentence Classiﬁcation [#]

We will now consider the case of sentence classiﬁcation. The corpus in this example is annotated with ‘Sentence’ annotations, which contain the feature ‘sent_size’, as well as the class of the sentence. Furthermore, ‘Token’ annotations are applied, having features ‘category’ and ‘root’. As before, all annotations are in the same set, and the annotation set name will be passed to the PR at run time.

Below is an example conﬁguration ﬁle. It ﬁrst speciﬁes surround mode as ‘false’, because it is a text classiﬁcation problem; we are interested in classifying single instances rather than chunks of instances. Our targets of interest, sentences, have already been found (unlike in the information extraction example, where identifying the limits of the entity was part of the problem). The next two options allow the label list and the NLP feature list to be updated from the training data when retraining. It also speciﬁes probability thresholds for entity and entity boundary. Note that these two speciﬁcations will not be used in this case. However, their presence is not problematic; they will simply be ignored. The probability threshold for classiﬁcation is set as ‘0.5’. This will be used to decide which classiﬁcations to accept and which to reject as being too unlikely. (Altering this parameter can trade oﬀ precision against recall and vice versa.) The evaluation will use the hold-out test method. It will randomly select 66% of the documents from the corpus for training, and the other 34% documents will be used for testing. It will run the evaluation twice, and average the results over the two runs. Note that it does not specify the method of converting a multi-class classiﬁcation problem into several binary class problem, meaning that it will adopt the default (namely one against all others).

The conﬁguration ﬁle speciﬁes KNN (K-Nearest Neighbour) as the learning algorithm. It also speciﬁes the number of neighbours used as 5. Of course other learning algorithms can be used as well. For example, the ENGINE element in the previous example, which speciﬁes SVM as learning algorithm, can be put into this conﬁguration ﬁle to replace the current one.

In the DATASET element, the annotation ‘Sentence’ is used as instance type. Two kinds of linguistic features are deﬁned; one is NGRAM and the other is ATTRIBUTE. The n-gram is based on the annotation ‘Token’. It is a unigram, as its NUMBER element has the value 1. This means that a ‘bag of words’ feature will be formed from the tokens comprising the sentence. It is based on the two features, ‘root’ and ‘category’, of the annotation ‘Token’. This introduces a new aspect to the n-gram. The n-gram feature comprises counts of the unigrams appearing in the sentence. For example, if the sentence were ‘the man walked the dog", the unigram feature would contain the information that ‘the’ appeared twice, and ‘man’, ‘walked’ and ‘dog’ appeared once. However, since our n-gram has two features, ‘root’ and ‘category’, two tokens will be considered the same term if and only if they have the same ‘root’ feature and the same ‘category’ feature. The weight of the ngram is set as 10.0, meaning its contribution is ten times that of the contribution of the other feature, the sentence length. The feature ‘sent_size’ of the annotation ‘Sentence’ is given as an ATTRIBUTE feature. Finally the values of the feature ‘class’ of the annotation ‘Sentence’ are nominated as the class labels.

<?xml version="1.0"?>
<ML-CONFIG>
  <SURROUND value="false"/>
  <IS-LABEL-UPDATABLE value="true"/>
  <IS-NLPFEATURELIST-UPDATABLE value="true"/>
  <PARAMETER name="thresholdProbabilityEntity" value="0.2"/>
  <PARAMETER name="thresholdProbabilityBoundary" value="0.42"/>
  <PARAMETER name="thresholdProbabilityClassification" value="0.5"/>
  <EVALUATION method="holdout" runs="2" ratio="0.66"/>
  <ENGINE nickname="KNN" implementationName="KNNWeka" options = " -k 5 "/>
  <DATASET>
     <INSTANCE-TYPE>Sentence</INSTANCE-TYPE>
     <NGRAM>
        <NAME>Sent1gram</NAME>
        <NUMBER>1</NUMBER>
        <CONSNUM>2</CONSNUM>
        <CONS-1>
            <TYPE>Token</TYPE>
            <FEATURE>root</FEATURE>
        </CONS-1>
        <CONS-2>
            <TYPE>Token</TYPE>
            <FEATURE>category</FEATURE>
        </CONS-2>
        <WEIGHT>10.0</WEIGHT>
     </NGRAM>
     <ATTRIBUTE>
        <NAME>Class</NAME>
        <SEMTYPE>NOMINAL</SEMTYPE>
        <TYPE>Sentence</TYPE>
        <FEATURE>sent_size</FEATURE>
        <POSITION>0</POSITION>
     </ATTRIBUTE>
     <ATTRIBUTE>
        <NAME>Class</NAME>
        <SEMTYPE>NOMINAL</SEMTYPE>
        <TYPE>Sentence</TYPE>
        <FEATURE>class</FEATURE>
        <POSITION>0</POSITION>
        <CLASS/>
     </ATTRIBUTE>
   </DATASET>
</ML-CONFIG>

Relation Extraction [#]

The last example is for relation extraction. The relation extraction support in the PR is based on the work described in [Wang et al. 06].

Two concepts are key in a relation extraction corpus. Entities are the things that may be related, and relations describe the relationship between the entities if any. In our example, entities are pre-identiﬁed, and the task is to identify the relationships between them. The corpus for this example is annotated with the following:

‘ACEEntity’ annotations indicate the entities of interest in the corpus.
‘RE_INS’ annotations form the instances, and there is an instance for every pair of ‘ACEEntities’ within a sentence. ‘RE_INS’ annotations span the entire of the text between and including their ‘ACEEntity’ annotations. For example, ‘the commander of Israeli troops’ might be a potential relationship between a person, ‘the commander’, and an entity, ‘Israeli troops’. Its ‘RE_INS’ annotation covers the entire of this text. It contains ‘arg1’ and ‘arg2’ features containing the numerical identiﬁers of the two ‘ACEEntities’ to which it pertains. These numerical identiﬁers match the ‘MENTION_ID’ feature of the ‘ACEEntity’ annotation.
‘ACERelation’ annotations indicate the relations we wish to learn, and also span the entire of the text involved in the relationship. They include the features ‘MENTION_ARG1’ and ‘MENTION_ARG2’, which, again, contain the numerical identiﬁer found in the ‘MENTION_ID’ feature of the ‘ACEEntity’ annotations, as well as ‘Relation_type’, indicating the type of the relation.
Various ANNIE-style annotations are also included.

Our task is to select the ‘RE_INS’ instances that match the ‘ACERelations’. You will see that throughout the conﬁguration ﬁle, annotation types are speciﬁed in conjunction with argument identiﬁers. This is because we need to ensure that the annotation in question pertains to the right entities. Therefore, argument identiﬁers are used to constrain the match.

The conﬁguration ﬁle does not specify any optional settings, meaning that it uses all the default values for those settings (see Section 19.2.1 for the default values of all possible settings).

it sets the surround mode as ‘false’;
both the label list and NLP feature list are updatable;
the probability threshold for classiﬁcation is set as 0.5;
it uses ‘one against others’ for converting multi-class problem into binary class problems for SVM learning;
for evaluation it uses hold-out testing with a ratio of 0.66 and only one run.

The conﬁguration ﬁle speciﬁes the learning algorithm as the Naive Bayes method implemented in Weka. However, other learning algorithms could equally well be used.

We begin by deﬁning ‘RE_INS’ as the instance type. Next, we provide the numeric identiﬁers of each argument of the relationship by specifying elements INSTANCE-ARG1 and INSTANCE-ARG2 as the feature names ‘arg1’ and ‘arg2’ respectively. This indicates that the argument identiﬁers of the instances can be found in the ‘arg1’ and ‘arg2’ features of the ‘RE_INS’ annotations.

Attributes might pertain to the entire relation or they might pertain to one or other argument within the relation. We are going to begin by deﬁning the features speciﬁc to each argument of the relation. Recall that our ‘RE_INS’ annotations have as arguments two ‘ACEEntity’ annotations, and that these are identiﬁed by their ‘MENTION_ID’ being the same as the ‘arg1’ or ‘arg2’ features of the ‘RE_INS’. It is from these ‘ACEEntity’ annotations that we wish to obtain argument-speciﬁc features. FEATURES-ARG1 and FEATURES-ARG1 elements begin by specifying which annotation we are referring to. We use the ARG element to explain this. We are interested in annotations of type ‘ACEEntity’, and their ‘MENTION_ID’ must match ‘arg1’ or ‘arg2’ of ‘RE_INS’ as appropriate. Having identiﬁed precisely which ‘ACEEntity’ we are interested in we can go on to give argument-speciﬁc features; in this case, unigrams of the ‘Token’ feature ‘string’.

We now wish to deﬁne features pertaining to the entire relation. We indicate that the ‘t12’ feature of ‘RE_INS’ annotations is to be used (this feature contains type information derived from ‘ACEEntity’). Again, rather than just specifying the ‘RE_INS’ annotation, we also indicate that the ‘arg1’ and ‘arg2’ feature values must match the argument identiﬁers of the instance, as deﬁned in the INSTANCE-ARG1 and INSTANCE-ARG2 elements at the beginning. This ensures that we are taking our features from the correct annotation.

Finally, we deﬁne the class attribute. We indicate that the class attribute is contained in the ‘Relation_type’ feature of the ‘ACERelation’ annotation. The ‘ACERelation’ annotation type has features ‘MENTION_ARG1’ and ‘MENTION_ARG1’, indicating its arguments. Again, we use the elements ARG1 and ARG2 to indicate that it is these features that must be matched to the arguments of the instance if that instance is to be considered a positive example of the class.

<?xml version="1.0"?>
<ML-CONFIG>
   <ENGINE nickname="NB" implementationName="NaiveBayesWeka"/>
   <DATASET>
     <INSTANCE-TYPE>RE_INS</INSTANCE-TYPE>
     <INSTANCE-ARG1>arg1</INSTANCE-ARG1>
     <INSTANCE-ARG2>arg2</INSTANCE-ARG2>
     <FEATURES-ARG1>
       <ARG>
         <NAME>ARG1</NAME>
         <SEMTYPE>NOMINAL</SEMTYPE>
         <TYPE>ACEEntity</TYPE>
         <FEATURE>MENTION_ID</FEATURE>
       </ARG>
       <ATTRIBUTE>
         <NAME>Form</NAME>
         <SEMTYPE>NOMINAL</SEMTYPE>
         <TYPE>Token</TYPE>
         <FEATURE>string</FEATURE>
         <POSITION>0</POSITION>
       </ATTRIBUTE>
       </FEATURES-ARG1>
       <FEATURES-ARG2>
        <ARG>
          <NAME>ARG2</NAME>
          <SEMTYPE>NOMINAL</SEMTYPE>
          <TYPE>ACEEntity</TYPE>
          <FEATURE>MENTION_ID</FEATURE>
        </ARG>
        <ATTRIBUTE>
          <NAME>Form</NAME>
          <SEMTYPE>NOMINAL</SEMTYPE>
          <TYPE>Token</TYPE>
          <FEATURE>string</FEATURE>
          <POSITION>0</POSITION>
        </ATTRIBUTE>
      </FEATURES-ARG2>
      <ATTRIBUTE_REL>
        <NAME>EntityCom1</NAME>
        <SEMTYPE>NOMINAL</SEMTYPE>
        <TYPE>RE_INS</TYPE>
        <ARG1>arg1</ARG1>
        <ARG2>arg2</ARG2>
        <FEATURE>t12</FEATURE>
     </ATTRIBUTE_REL>
     <ATTRIBUTE_REL>
       <NAME>Class</NAME>
       <SEMTYPE>NOMINAL</SEMTYPE>
       <TYPE>ACERelation</TYPE>
       <ARG1>MENTION_ARG1</ARG1>
       <ARG2>MENTION_ARG2</ARG2>
       <FEATURE>Relation_type</FEATURE>
       <CLASS/>
     </ATTRIBUTE_REL>
</DATASET>
</ML-CONFIG>

19.2.3 How to Use the Batch Learning PR in GATE Developer [#]

The Batch Learning PR implements the procedure of using supervised machine learning for NLP, which generally has two steps; training and application. The training step learns models from labelled data. The application step applies the learned models to the unlabelled data in order to add labels. Therefore, in order to use supervised ML for NLP, one should have some labelled data, which can be obtained either by manually annotating documents or from other resources. One also needs to determine which linguistic features are to be used in training. (The same features should be used in the application as well.) In this implementation, all machine learning attributes are GATE annotation features. Finally, one should determine which learning algorithm will be used.

Based on the general procedure outlined above, we explain how to use the Batch Learning PR step by step below:

Annotate some documents with labels that you want to learn. The labels should be represented by the values of a feature of a GATE annotation type (not the annotation type itself).
Determine the linguistic features that you want the PR to use for learning.
Annotate the documents (training and application) with the desired features. ANNIE can be useful in this regard. Other PRs such as GATE morphological analyser and the parsers may produce useful features as well. You may need to write some JAPE scripts to produce the features you want.
Create an XML conﬁguration ﬁle for your learning problem. The ﬁle should contain one DATASET element specifying the NLP features used, one ENGINE element specifying the learning algorithm, and some optional settings as necessary. (Tip: it may be easier to copy one of the conﬁguration ﬁles presented above and modify it for your problem than to write a conﬁguration ﬁle from scratch.)
Load the training documents containing the required annotations representing the linguistic features and the class label, and put them into a corpus. All linguistic features and the class feature should be in the same annotation set. (The Annotation Set Transfer PR in the ‘Tools’ plugin can be useful here.)
Load the Batch Learning PR into GATE Developer. First you need load the plugin named ‘learning’ using the tool Manage CREOLE Plugins. Then you can create a new ‘Batch Learning PR’. You will need to provide the conﬁguration ﬁle as an initialization parameter. After that you can put the PR into a Corpus Pipeline application to use it. Add the corpus containing the training documents to the application too. Set the inputASName to the annotation set containing the annotations for linguistic features and class labels.
Set the run-time parameter learningMode to ‘TRAINING’ to learn a model from the training data, or set learningMode to ‘EVALUATION’ to do evaluation on the training data and get ﬁgures indicating the success of the learning. When using evaluation mode, make sure that the outputASName is the same as the inputASName. (Tip: it may save time if you ﬁrst try evaluation mode on a small number of documents to make sure that the ML PR works well on your problem and outputs reasonable results before training on the large data.)
If you want to apply the learned model to new documents, load those new documents into GATE and pre-process them in the same way as the training documents, to ensure that the same features are present. (Class labels need not be present, of course.) Then set learningMode to ‘APPLICATION’ and run the PR on this corpus. The application results, namely the new annotations containing the class labels, will be added into the annotation set speciﬁed by the outputASName.
If you just want the feature ﬁles produced by the system and do not want to do any learning or application, select the learning mode ‘ProduceFeatureFilesOnly’.

19.2.4 Output of the Batch Learning PR [#]

The Batch Learning PR outputs several diﬀerent kinds of information. Firstly, it outputs information about the learning settings. This information will be printed in the Messages Window of the GATE Developer (or standard out if using GATE Embedded) and also into the log ﬁle ‘logFileForNLPLearning.save’. The amount of information displayed can be determined via the VERBOSITY parameter in the conﬁguration ﬁle. The main output of the learning system is diﬀerent for diﬀerent usage modes. In training mode the system produces the learned models. In application mode it annotates the documents using the learned models. In evaluation mode it displays the evaluation results. Finally, in ‘ProduceFeatureFilesOnly’ mode, it produces feature ﬁles for the current corpus. Below, we explain the outputs for diﬀerent learning modes.

Note that all the ﬁles produced by the Batch Learning PR, including the log ﬁle, are placed in the sub-directory ‘savedFiles’ of the ML working directory. The ML working directory is the directory containing the conﬁguration ﬁle.

Training results

When the Batch Learning PR is used in training mode, its main output is the learned model, stored in a ﬁle named ‘learnedModels.save’. For the SVM algorithm, the learned model ﬁle is a text ﬁle. For the learning algorithms implemented in Weka, the model ﬁle is a binary ﬁle. The output also includes the feature ﬁles described in Section 19.2.4.

Application Results

The main application result is the annotations added to the documents. Those annotations are the results of applying the ML model to the documents. In the conﬁguration ﬁle, the annotation type and feature of the class labels are speciﬁed; class labels must be the value of a feature of an annotation type. In application mode, those annotation types are created in the new documents, and the feature speciﬁed will hold the class label. An additional feature will also be included on the speciﬁed annotation type; ‘prob’ will hold the conﬁdence level for the annotation.

Evaluation Results

The Batch Learning PR outputs the evaluation results for each run and also the averaged results over all runs. For each run, it ﬁrst prints a message about the names of the documents in training and testing corpora respectively. Then it displays the evaluation results of this run; ﬁrst the results for each class label and then the micro-averaged results over all labels. For each label, it presents the name of the label, the number of instances belonging to the label in the training data and results on the test data; the numbers of correct, partially correct, spurious and missing instances in the testing data, and the precision, recall and F1, calculated using correct only (strict) and correct plus partial (lenient). The F-measure results are obtained using the AnnotationDiﬀ Tool which is described in Chapter 10. Finally, the system presents the means of the results of all runs for each label and the micro-averaged results.

Feature Files [#]

The Batch Learning PR is able to produce several feature ﬁles. These feature ﬁles could be used for evaluating learning algorithms not implemented in this plugin. We describe the formats of those feature ﬁles below. Note that all the data ﬁles described below can be obtained by setting the run time parameter learningMode to ‘ProduceFeatureFilesOnly’, but some may be produced as part of other learning modes.

The NLP feature ﬁle, named NLPFeatureData.save, contains the NLP features of the instances deﬁned in the conﬁguration ﬁle. Below is an example of the ﬁrst few lines of an NLP feature ﬁle for information extraction:

Class(es) Form(-1) Form(0) Form(1) Ortho(-1) Ortho(0) Ortho(1)
0 ft-airlines-27-jul-2001.xml 512
1 Number_BB _NA[-1] _Form_Seven _Form_UK[1] _NA[-1] _Ortho_upperInitial
_Ortho_allCaps[1]
1 Country_BB _Form_Seven[-1] _Form_UK _Form_airlines[1] _Ortho_upperInitial[-1]
         _Ortho_allCaps _Ortho_lowercase[1]
0 _Form_UK[-1] _Form_airlines _Form_including[1] _Ortho_allCaps[-1]
_Ortho_lowercase _Ortho_lowercase[1]
0 _Form_airlines[-1] _Form_including _Form_British[1] _Ortho_lowercase[-1]
         _Ortho_lowercase _Ortho_upperInitial[1]
1 Airline_BB _Form_including[-1] _Form_British _Form_Airways[1]
         _Ortho_lowercase[-1] _Ortho_upperInitial _Ortho_upperInitial[1]
1 Airline _Form_British[-1] _Form_Airways _Form_[1], _Ortho_upperInitial[-1]
         _Ortho_upperInitial _NA[1]
0 _Form_Airways[-1] _Form_, _Form_Virgin[1] _Ortho_upperInitial[-1] _NA
         _Ortho_upperInitial[1]

The ﬁrst line of the NLP feature ﬁle lists the names of all features used. These names are the names the user gave to their features in the conﬁguration ﬁle. The number in the parenthesis following a feature name indicates the position of the feature. For example, ‘Form(-1)’ means the Form feature of the token which is immediately before the current token, and ‘Form(0)’ means the Form feature of the current token. The NLP features for all instances are listed for one document before moving on to the next. For each document, the ﬁrst line shows the index of the document, the document’s name and the number of instances in the document, as shown in the second line above. After that, each line corresponds to an instance in the document, in their order of appearance. The ﬁrst item on the line is a number n, representing the number of class labels of the instance. Then, the following n items are the labels. If the current instance is the ﬁrst instance of an entity, its corresponding label has a suﬃx ‘_BB’. The other items following the label item(s) are the NLP features of the instance, in the order listed in the ﬁrst line of the ﬁle. Each NLP feature contains the feature’s name and value, separated by ‘_’. At the end of one NLP feature, there may be an integer in square brackets, which represents the position of the feature relative to the current instance. If there is no square-bracketed integer at the end of one NLP feature, then the feature is at the position 0.

The Feature vector ﬁle has the ﬁle name ‘featureVectorsData.save’, and stores the feature vector in sparse format for each instance. The ﬁrst few lines of the feature vector ﬁle corresponding to the NLP feature ﬁle shown above are as follows:

0 512 ft-airlines-27-jul-2001.xml
1 2 1 2 439:1.0 761:1.0 100300:1.0 100763:1.0
2 2 3 4 300:1.0 763:1.0 50439:1.0 50761:1.0 100440:1.0 100762:1.0
3 0 440:1.0 762:1.0 50300:1.0 50763:1.0 100441:1.0 100762:1.0
4 0 441:1.0 762:1.0 50440:1.0 50762:1.0 100020:1.0 100761:1.0
5 1 5 20:1.0 761:1.0 50441:1.0 50762:1.0 100442:1.0 100761:1.0
6 1 6 442:1.0 761:1.0 50020:1.0 50761:1.0 100066:1.0
7 0 66:1.0 50442:1.0 50761:1.0 100443:1.0 100761:1.0

The feature vectors are also listed for each document in sequence. For each document, the ﬁrst line shows the index of the document, the number of instances in the document and the document’s name. Each of the following lines is for each of the instances in the document. The ﬁrst item in the line is the index of the instance in the document. The second item is a number n, representing the number of labels the instance has. The following n items are indices representing the class labels.

For text classiﬁcation and relation learning, the label’s index comes directly from the label list ﬁle, described below. For chunk learning, the label’s index presented in the feature vector ﬁle is a bit more complicated. If an instance (e.g. token) is the ﬁrst one of a chunk with label k, then the instance has as the label’s index 2 ∗k − 1, as shown in the ﬁfth instance. If it is the last instance of the chunk, it has the label’s index as 2 ∗ k, as shown in the sixth instance. If the instance is both the ﬁrst one and the last one of the chunk (namely the chunk consists of one instance), it has two label indices, 2 ∗ k − 1 and 2 ∗ k, as shown in the ﬁrst and second instances.

The items following the label(s) are the non-zero components of the feature vector. Each component is represented by two numbers separated by ‘:’. The ﬁrst number is the dimension (position) of the component in the feature vector, and the second one is the value of the component.

The Label list ﬁle has the name ‘LabelsList.save’, and stores a list of labels and their indices. The following is a part of a label list. Each line shows one label name and its index in the label list.

Airline 3
Bank 13
CalendarMonth 11
CalendarYear 10
Company 6
Continent 8
Country 2
CountryCapital 15
Date 21
DayOfWeek 4

The NLP feature list has the name ‘NLPFeaturesList.save’, and contains a list of NLP features and their indices in the list. The following are the ﬁrst few lines of an NLP feature list ﬁle.

totalNumDocs=14915
_EntityType_Date 13 1731
_EntityType_Location 170 1081
_EntityType_Money 523 3774
_EntityType_Organization 12 2387
_EntityType_Person 191 421
_EntityType_Unknown 76 218
_Form_’ 112 775
_Form_\$ 527 74
_Form_’ 508 37
_Form_’s 63 731
_Form_( 526 111

The ﬁrst line of the ﬁle shows the number of instances from which the NLP features were collected. The number of instances will be used for computating of the idf (inverse document frequency) in document or sentence classiﬁcation. The following lines are for the NLP features. Each line is for one unique feature. The ﬁrst item in the line represents the NLP feature, which is a combination of the feature’s name deﬁned in the conﬁguration ﬁle and the value of the feature. The second item is a positive integer representing the index of the feature in the list. The last item is the number of times that the feature occurs, which is needed for computing the idf.

The N-grams (or language model) ﬁle has the name ‘NgramList.save’, and can only be produced by setting the learning mode to ‘ProduceFeatureFilesOnly’. In order to produce n-gram data, the user may use a very simple conﬁguration ﬁle, i.e. it need only contain the DATASET element, and the data element need contain only an NGRAM element to specify the type of n-gram and the INSTANCE-TYPE element to deﬁne the annotation type from which the n-gram data are created (e.g. sentence). The NGRAM element in conﬁguration ﬁle speciﬁes what type of n-grams the PR produces (see Section 19.2.1 for the explanation of the n-gram deﬁnition). For example, if you specify a bigram based on the string form of ‘Token’, you will obtain a list of bigrams from the corpus you used. The following are the ﬁrst lines of a bigram list based on the token annotation’s ‘string’ feature, and was calculated over 3 documents.

## The following 2-gram were obtained from 3 documents or examples
Aug<>, 3
Female<>; 3
Human<>; 3
2004<>Aug 3
;<>Female 3
.<>The 3
of<>a 3
)<>: 3
,<>and 3
to<>be 3
;<>Human 3

The two terms of the bigram are separated by ‘<>’. The number following one n-gram is the number of occurrences of that n-gram in the corpus. The n-gram list is ordered according to the number of occurrences of the n-gram terms. The most frequent terms in the corpus are therefore at the start of the list.

The n-gram data produced can be based on any features of annotations available in the documents. Hence it can not only produce the conventional n-gram data based on the token’s form or lemma, but also n-grams based on e.g. the token’s POS, or a combination of the token’s POS and form, or any feature of the ‘sentence’ annotation (see Section 19.2.1 for how to deﬁne diﬀerent types of n-gram).

The Document-term matrix ﬁle has the name ‘documentByTermMatrix.save’, and can only be produced by setting the learning mode to ‘ProduceFeatureFilesOnly’. The document-term matrix presents the weights of terms appearing in each document (see Section 23.17 for more explanation). Currently three types of weight are implemented; binary, term frequency (tf) and tf-idf. The binary weight is simply 1 if the term appears in document and 0 if it does not. tf (term frequency) refers to the number of occurrences of one term in a document. tf-idf is popular in information retrieval and text mining. It is a multiplication of term frequency and inverse document frequency. Inverse document frequency is calculated as follows:

------|D-|----- idfi = log|{d : t ∈ d }| j i j

where |D| is the total number of documents in the corpus, and |{d_j : t_i ∈ d_j}| is the number of documents in which the term t_i appears. The type of weight is speciﬁed by the sub-element ValueTypeNgram in the DATASET element in conﬁguration ﬁle (see Section 19.2.1).

Like the n-gram data, in order to produce the document-term matrix, the user may use a very simple conﬁguration ﬁle, i.e. it need only contain the DATASET element, and the data element need only contain two elements; the INSTANCE-TYPE element, to deﬁne the annotation type from which the terms are counted, and an NGRAM element to specify the type of n-gram. As mentioned previously, the element ValueTypeNgram speciﬁes the type of value used in the matrix. If it is not present, the default type tf-idf will be used. The conventional document-term matrix can be produced using a unigram based on the token’s form or lemma and the instance type covering the whole document. In other words, INSTANCE-TYPE is set to an annotation type such as for example ‘body’, which covers the entire document, and the n-gram deﬁnition then speciﬁes the ‘string’ feature of the ‘Token’ annotation type.

The following was extracted from the beginning of a document-term matrix ﬁle, produced using unigrams of the token’s form. It presents a part of the matrix of terms and their term frequency values in the document named ‘27.xml’. Each term and its term frequency are separated by ‘:’. The terms are in alphabetic order.

0 Documentname="27.xml", has 1 parts: ":2 (:6 ):6 ,:14 -:1 .:16 /:1
124:1 2004:1 22:1 29:1 330:1 54:1 8:2 ::5 ;:11 Abstract:1 Adaptation:1
Adult:1 Atopic:2 Attachment:3 Aug:1 Bindungssicherheit:1 Cross-:1
Dermatitis:2 English:1 F-SOZU:1 Female:1 Human:1 In:1 Index:1
Insecure:1 Interpersonal:1 Irrespective:1 It:1 K-:1 Lebensqualitat:1
Life:1 Male:1 NSI:2 Neurodermitis:2 OT:1 Original:1 Patients:1
Psychological:1 Psychologie:1 Psychosomatik:1 Psychotherapie:1
Quality:1 Questionnaire:1 RSQ:1 Relations:1 Relationship:1 SCORAD:1
Scales:1 Sectional:1 Securely:1 Severity:2 Skindex-:1 Social:1
Studies:1 Suffering:1 Support:1 The:1 Title:1 We:3 [:1 ]:1 a:4
absence:1 affection:1 along:2 amount:1 an:1 and:9 as:1 assessed:1
association:2 atopic:5 attached:7

A list of names of documents processed can also be obtained. The ﬁle has the name ‘docsName.save’, and only can be produced by setting the learning mode to ‘ProduceFeatureFilesOnly’. It contains the names of all the documents processed. The ﬁrst line shows the number of documents in the list. Then, each line lists one document’s name. The ﬁrst lines of an example ﬁle are shown below:

##totalDocs=3
ft-bank-of-england-02-aug-2001.xml
ft-airtours-08-aug-2001.xml
ft-airlines-27-jul-2001.xml

A list of names of the selected documents for active learning purposes can also be produced. The ﬁle has the name ‘ALSelectedDocs.save’. It is a text ﬁle. It is produced in ‘ProduceFeatureFilesOnly’ mode. The ﬁle contains the names of documents which have been selected for annotating and training in the active learning process. It is used by the ‘RankingDocsForAL’ learning mode to exclude those selected documents from the ranked documents for active learning purposes. When one or more documents are selected for annotating and training, their names should be put into this ﬁle, one line per document.

A list of names of ranked documents for active learning purposes; the ﬁle has the name ‘ALRankedDocs.save’, and is produced in ‘RankingDocsForAL’ mode. The ﬁle contains the list of names of the documents ranked for active learning, according to their usefulness for learning. Those in the front of the list are the most useful documents for learning. The ﬁrst line in the ﬁle shows the total number of documents in the list. Each of other lines in the ﬁle lists one document and the averaged conﬁdence score for classifying the document. An example of the ﬁle is shown below:

##numDocsRanked=3
ft-airlines-27-jul-2001.xml_000201 8.61744
ft-bank-of-england-02-aug-2001.xml_000221 8.672693
ft-airtours-08-aug-2001.xml_000211 9.82562

19.2.5 Using the Batch Learning PR from the API [#]

Using the Batch Learning PR from the API is a simple matter if you have some familiarity with GATE Embedded. Chapter 7 provides a more comprehensive introduction to programming with GATE Embedded, and should be consulted for any general points. There is also a complete example program on the code examples page.

The following snippet shows creating a pipeline application, with a corpus, then creating a batch learning PR and adding it to the application. The location of the conﬁguration ﬁle and the mode in which the PR is to be run are added to the PR. The application is then run. ‘corpus’ is a GATE corpus that you have previously set up. (To learn more about creating a corpus from GATE Embedded, see chapter 7 or the example at the code examples page.)

1File configFile = new File("/home/you/ml_config.xml"); //Wherever it is
2RunMode mode = RunMode.EVALUATION; //or TRAINING, or APPLICATION ..
3
4//Make a pipeline and add the corpus
5FeatureMap pfm = Factory.newFeatureMap();
6pfm.put("corpus", corpus);
7gate.creole.SerialAnalyserController pipeline =
8          (gate.creole.SerialAnalyserController)
9          gate.Factory.createResource(
10                 "gate.creole.SerialAnalyserController", pfm);
11
12//Set up the PR and add it to the pipeline.
13//As with using the PR from GATE Developer, it needs a config file
14//and a mode.
15FeatureMap fm = Factory.newFeatureMap();
16fm.put("configFileURL", configFile.toURI().toURL());
17fm.put("learningMode", mode);
18gate.learning.LearningAPIMain learner =
19        (gate.learning.LearningAPIMain)
20        gate.Factory.createResource("gate.learning.LearningAPIMain", fm);
21pipeline.add(learner);
22
23//Run it!
24pipeline.execute();

Having run the PR in EVALUATION mode, you can access the results programmatically:

1EvaluationBasedOnDocs ev = learner.getEvaluation();
2System.out.println(
3                ev.macroMeasuresOfResults.precision + "," +
4                ev.macroMeasuresOfResults.recall + "," +
5                ev.macroMeasuresOfResults.f1 + "," +
6                ev.macroMeasuresOfResults.precisionLenient + "," +
7                ev.macroMeasuresOfResults.recallLenient + "," +
8                ev.macroMeasuresOfResults.f1Lenient + "\n");

19.3 Machine Learning PR [#]

The ‘Machine Learning PR’ is GATE’s earlier machine learning PR. It handles both the training and application of ML model on GATE documents. This PR is a Language Analyser so it can be used in all default types of GATE controllers. It can be found in the ‘Machine_Learning’ plugin.

In order to allow for more ﬂexibility, all the conﬁguration parameters for the Machine Learning PR are set through an external XML ﬁle and not through the normal PR parameterisation. The root element of the ﬁle needs to be called ‘ML-CONFIG’ and it contains two elements: ‘DATASET’ and ‘ENGINE’. An example XML conﬁguration ﬁle is given in Section 19.3.6.

19.3.1 The DATASET Element

The DATASET element deﬁnes the type of annotation to be used as instance and the set of attributes that characterise all the instances.

An ‘INSTANCE-TYPE’ element is used to select the annotation type to be used for instances, and the attributes are deﬁned by a sequence of ‘ATTRIBUTE’ elements.

For example, if an ‘INSTANCE-TYPE’ has a ‘Token’ for value, there will one instance in the dataset per ‘Token’. This also means that the positions (see below) are deﬁned in relation to Tokens. The ‘INSTANCE-TYPE’ can be seen as the smallest unit to be taken into account for the Machine Learning.

An ATTRIBUTE element has the following sub-elements:

NAME: the name of the attribute
TYPE: the annotation type used to extract the attribute.
FEATURE (optional): if present, the value of the attribute will be the value of the named feature on the annotation of speciﬁed type.
POSITION: the position of the annotation used to extract the feature relative to the current instance annotation.
VALUES(optional): includes a list of VALUE elements.
<CLASS/>: an empty element used to mark the class attribute. There can only be one attribute marked as class in a dataset deﬁnition.

The VALUES being deﬁned as XML entities, the characters <, > and & must be replaced by <, &rt; and &. It is recommended to write the XML conﬁguration ﬁle in UTF-8 in order that uncommon characters are correctly parsed.

Semantically, there are three types of attributes:

nominal attributes: both type and features are deﬁned and a list of allowed values is provided;
numeric: both type and features are deﬁned but no list of allowed values is provided; it is assumed that the feature can be converted to a number (a double value).
boolean: no feature or list of values is provided; the attribute will take one of the ‘true’ or ‘false’ values based on the presence (or absence) of the speciﬁed annotation type at the required position.

Figure 19.1 gives some examples of what the values of speciﬁed attributes would be in a situation when ‘Token’ annotations are used as instances.

Figure 19.1: Sample attributes and their values

An ATTRIBUTELIST element is similar to ATTRIBUTE except that it has no POSITION sub-element but a RANGE element. This will be converted into several ATTRIBUTELIST with position ranging from the value of the attribute ‘from’ to the value of the attribute ‘to’. This can be used in order to avoid the duplication of ATTRIBUTE elements.

19.3.2 The ENGINE Element

The ENGINE element deﬁnes which particular ML implementation will be used, and allows the setting of options for that particular implementation.

The ENGINE element has three sub-elements:

WRAPPER: deﬁnes the class name for the ML implementation (or implementation wrapper). The speciﬁed class needs to extend gate.creole.ml.MLEngine.
BATCH-MODE-CLASSIFICATION: this element is optional. If present (as an empty element <BATCH-MODE-CLASSIFICATION />), the training instances will be passed to the engine in a single batch. If absent, the instances are passed to the engine one at a time. Not every engine supports this option, but for those that do, it can greatly improve performance.
OPTIONS: the contents of the OPTIONS element will be passed verbatim to the ML engine used.

19.3.3 The WEKA Wrapper

The PR provides a wrapper for the WEKA ML Library (http://www.cs.waikato.ac.nz/ml/weka/) in the form of the gate.creole.ml.weka.Wrapper class.

Options for the WEKA Wrapper

The WEKA wrapper accepts the following options:

CLASSIFIER: the class name for the classiﬁer to be used.
CLASSIFIER-OPTIONS: the options string as required for the classiﬁer.
CONFIDENCE-THRESHOLD: a double value. If the classiﬁer can provide a probability distribution rather than a simple classiﬁcation then all possible classiﬁcations that have a probability value larger or equal to the conﬁdence threshold will be considered.
DATASET-FILE: location of the weka arﬀ ﬁle. This item is not mandatory, it is possible to specify the ﬁle using the saving option on the GUI.

Training an ML Model with the WEKA Wrapper

The Machine Learning PR has a Boolean runtime parameter named "training". When the value of this parameter is set to true, the PR will collect a dataset of instances from the documents on which it is run. If the classiﬁer used is an updatable classiﬁer then the ML model will be built while collecting the dataset. If the selected classiﬁer is not updatable, then the model will be built the ﬁrst time a classiﬁcation is attempted.

Training a model consists of designing a deﬁnition ﬁle for the ML PR, and creating an application containing a Machine Learning PR. When the application is run over a corpus, the dataset (and the model if possible) is built.

Applying a Learnt Model

Using the same PR, set the ‘training’ parameter to false and run your application.

Depending on the type of the attribute that is marked as class, diﬀerent actions will be performed when a classiﬁcation occurs:

if the attribute is boolean, a new annotation of the speciﬁed type will be created with no features;
if the attribute is nominal or numeric, a new annotation of the speciﬁed type will be created with the feature named in the attribute deﬁnition having the value predicted by the classiﬁer.

Once a model is learnt, it can be saved and reloaded at a later time. The WEKA wrapper also provides an operation for saving only the dataset in the ARFF format, which can be used for experiments in the WEKA interface. This could be useful for determining the best algorithm to be used and the optimal options for the selected algorithm.

19.3.4 The MAXENT Wrapper [#]

GATE also provides a wrapper for the Open NLP MAXENT library
(http://maxent.sourceforge.net/about.html). The MAXENT library provides an implementation of the maximum entropy learning algorithm, and can be accessed using the gate.creole.ml.maxent.MaxentWrapper class.

The MAXENT library requires all attributes except for the class attribute to be boolean, and that the class attribute be boolean or nominal. (It should be noted that, within maximum entropy terminology, the class attribute is called the ‘outcome’.) Because the MAXENT library does not provide a speciﬁc format for data sets, there is no facility to save or load data sets separately from the model, but if there should be a need to do this, the WEKA wrapper can be used to collect the data.

Training a MAXENT model follows the same general procedure as for WEKA models, but the following diﬀerence should be noted. MAXENT models are not updateable, so the model will always be created and trained the ﬁrst time a classiﬁcation is attempted. The training of the model might take a considerable amount of time, depending on the amount of training data and the parameters of the model.

Options for the MAXENT Wrapper

CUT-OFF: MAXENT features will only be included in the model if they occur at least this many times. (The default value of this parameter is zero.)
ITERATIONS: The number of times the training procedure should iterate when ﬁnding the model’s parameters (default is 10). In general no more than about 100 iterations should be needed to train a model, and it is recommended that less are used during development to allow for shorter training times.
CONFIDENCE-THRESHOLD: Same as for the WEKA wrapper (see above). However, if this parameter is not set, or is set to zero, the model will not use a conﬁdence threshold, but will simply return the most likely classiﬁcation.
SMOOTHING: Use smoothing when training the model. Smoothing can improve the accuracy of the learned models, but it will result in longer training times, and training will use more memory. The size of the learned models will also be larger. Generally smoothing will only improve performance for those models trained from small data sets with a few outcomes. With larger data sets with lots of outcomes, it may make performance worse.
SMOOTHING-OBSERVATION: When using smoothing, this will specify the number of times that trainer will imagine that it has seen features which it did not see (default value is 0.1).
VERBOSE: If selected, this will cause the classiﬁer to output more details of its operation during execution.

19.3.5 The SVM Light Wrapper [#]

The PR provides a wrapper for the SVM Light ML system (http://svmlight.joachims.org). SVM Light is a support vector machine implementation, written in C, which is provided as a set of command line programs. The wrapper takes care of the mundane work of converting the data structures between GATE and SVM Light formats, and calls the command line programs in the right sequence, passing the data back and forth in temporary ﬁles. The <WRAPPER> value for this engine is gate.creole.ml.svmlight.SVMLightWrapper.

The SVM Light binaries themselves are not distributed with GATE – you should download the version for your platform from http://svmlight.joachims.org and place svm_learn and svm_classify on your path.

Classifying documents using the SVMLightWrapper is a two phase procedure. In its ﬁrst phase, SVMWrapper collects data from the pre-annotated documents and builds the SVM model using the collected data to classify the unseen documents in its second phase. Below we describe brieﬂy an example of classifying the start time of the seminar in a corpus of email announcing seminars and provide more details later in the section.

Figure 19.2 explains step by step the process of collecting training data for the SVM classiﬁer. GATE documents, which are pre-annotated with the annotations of type Class and feature type=’stime’, are used as the training data. In order to build the SVM model, we require start and end annotations for each stime annotation. We use a pre-processor JAPE transduction script to mark the sTimeStart and sTimeEnd annotations on stime annotations. Following this step, the Machine Learning PR (SVMLightWrapper) with training mode set to true collects the training data from all training documents. A GATE corpus pipeline, given a set of documents and PRs to execute on them, executes all PRs one by one, only on one document at a time. Unless provided in a separate pipeline, it makes it impossible to send all training data (i.e. collected from all documents) altogether to the SVMWrapper using the same pipeline to build the SVM model. This results in the model not being built at the time of collecting training data. The state of the SVMWrapper can be saved to an external ﬁle once the training data is collected.

Figure 19.2: Flow diagram explaining the SVM training data collection

Before classifying any unseen document, SVM requires the SVM model to be available. In the absence of an up-to-date SVM model, SVMWrapper builds a new one using a command line SVM_learn utility and the training data collected from the training corpus. In other words, the ﬁrst SVM model is built when a user tries to classify the ﬁrst document. At this point the user has an option to save the model somewhere. This is to enable reloading of the model prior to classifying other documents and to avoid rebuilding of the SVM model everytime the user classiﬁes a new set of documents. Once the model becomes available, SVMWrapper classiﬁes the unseen documents which creates new sTimeStart and sTimeEnd annotations over the text. Finally, a post-processor JAPE transduction script is used to combine them into the sTime annotation. Figure 19.3 explains this process.

Figure 19.3: Flow diagram explaining document classifying process

The wrapper allows support vector machines to be created which do either boolean classiﬁcation or regression (estimation of numeric parameters), and so the class attribute can be boolean or numeric. Additionally, when learning a classiﬁer, SVM Light supports transduction, whereby additional examples can be presented during training which do not have the value of the class attribute marked. Presenting such examples can, in some circumstances, greatly improve the performance of the classiﬁer. To make use of this, the class attribute can be a three value nominal, in which case the ﬁrst value speciﬁed for that nominal in the conﬁguration ﬁle will be interpreted as true, the second as false and the third as unknown. Transduction will be used with any instances for which this attribute is set to the unknown value. It is also possible to use a two value nominal as the class attribute, in which case it will simply be interpreted as true or false.

The other attributes can be boolean, numeric or nominal, or any combination of these. If an attribute is nominal, each value of that attribute maps to a separate SVM Light feature. Each of these SVM Light features will be given the value 1 when the nominal attribute has the corresponding value, and will be omitted otherwise. If the value of the nominal is not speciﬁed in the conﬁguration ﬁle or there is no value for an instance, then no feature will be added.

An extension to the basic functionality of SVM Light is that each attribute can receive a weighting. These weighting can be speciﬁed in the conﬁguration ﬁle by adding <WEIGHTING> tags to the parts of the XML ﬁle specifying each attribute. The weighting for the attribute must be speciﬁed as a numeric value, and be placed between an opening <WEIGHTING> tag and a closing </WEIGHTING> one. Giving an attribute a greater weighting, will cause it to play a greater role in learning the model and classifying data. This is achieved by multiplying the value of the attribute by the weighting before creating the training or test data that is passed to SVM Light. Any attribute left without an explicitly speciﬁed weighting is given a default weighting of one. Support for these weightings is contained in the Machine Learning PR itself, and so is available to other wrappers, though at time of writing only the SVM Light wrapper makes use of weightings.

As with the MAXENT wrapper, SVM Light models are not updateable, so the model will be trained at the ﬁrst classiﬁcation attempt. The SVM Light wrapper supports <BATCH-MODE-CLASSIFICATION />, which should be used unless you have a very good reason not to.

The SVM Light wrapper allows both data sets and models to be loaded and saved to ﬁles in the same formats as those used by SVM Light when it is run from the command line. When a model is saved, a ﬁle will be created which contains information about the state of the SVM Light Wrapper, and which is needed to restore it when the model is loaded again. This ﬁle does not, however, contain any information about the SVM Light model itself. If an SVM Light model exists at the time of saving, and that model is up to date with respect to the current state of the training data, then it will be saved as a separate ﬁle, with the same name as the ﬁle containing information about the state of the wrapper, but with .NativePart appended to the ﬁlename. These ﬁles are in the standard SVM Light model format, and can be used with SVM Light when it is run from the command line. When a model is reloaded by GATE, both of these ﬁles must be available, and in the same directory, otherwise an error will result. However, if an up to date trained model does not exist at the time the model is saved, then only one ﬁle will be created upon saving, and only that ﬁle is required when the model is reloaded. So long as at least one training instance exists, it is possible to bring the model up to date at any point simply by classifying one or more instances (i.e. running the model with the training parameter set to false).

Options for the SVM Light Engine

Only one <OPTIONS> subelement is currently supported:

<CLASSIFIER-OPTIONS> a string of options to be passed to svm_learn on the command line. The only diﬀerence is that the user should not specify whether regression or classiﬁcation is to be used, as the wrapper will detect this automatically, based on the type of the class attribute, and set the option accordingly.

19.3.6 Example Conﬁguration File [#]

<?xml version="1.0" encoding="UTF-8"?>
<ML-CONFIG>
  <DATASET>
  
  <INSTANCE-TYPE>Token</INSTANCE-TYPE>
  <ATTRIBUTE>
    
    <NAME>Lookup(0)</NAME>
    
    <TYPE>Lookup</TYPE>
    
    <POSITION>0</POSITION>
  </ATTRIBUTE>

  <ATTRIBUTE>
    
    <NAME>Lookup_MT(-1)</NAME>
    
    <TYPE>Lookup</TYPE>
    
    <FEATURE>majorType</FEATURE>

    
    <POSITION>-1</POSITION>
    
    <VALUES>
      
      <VALUE>address</VALUE>
      <VALUE>cdg</VALUE>
      <VALUE>country_adj</VALUE>
      <VALUE>currency_unit</VALUE>
      <VALUE>date</VALUE>
      <VALUE>date_key</VALUE>
      <VALUE>date_unit</VALUE>
      <VALUE>facility</VALUE>
      <VALUE>facility_key</VALUE>
      <VALUE>facility_key_ext</VALUE>
      <VALUE>govern_key</VALUE>
      <VALUE>greeting</VALUE>
      <VALUE>ident_key</VALUE>
      <VALUE>jobtitle</VALUE>
      <VALUE>loc_general_key</VALUE>
      <VALUE>loc_key</VALUE>
      <VALUE>location</VALUE>
      <VALUE>number</VALUE>
      <VALUE>org_base</VALUE>
      <VALUE>org_ending</VALUE>
      <VALUE>org_key</VALUE>
      <VALUE>org_pre</VALUE>
      <VALUE>organization</VALUE>
      <VALUE>organization_noun</VALUE>
      <VALUE>percent</VALUE>
      <VALUE>person_ending</VALUE>
      <VALUE>person_first</VALUE>
      <VALUE>person_full</VALUE>
      <VALUE>phone_prefix</VALUE>
      <VALUE>sport</VALUE>
      <VALUE>spur</VALUE>
      <VALUE>spur_ident</VALUE>
      <VALUE>stop</VALUE>
      <VALUE>surname</VALUE>
      <VALUE>time</VALUE>
      <VALUE>time_modifier</VALUE>
      <VALUE>time_unit</VALUE>
      <VALUE>title</VALUE>
      <VALUE>year</VALUE>
    </VALUES>
    
  </ATTRIBUTE>

  <ATTRIBUTE>
    
    <NAME>Lookup_MT(0)</NAME>
    
    <TYPE>Lookup</TYPE>
    
    <FEATURE>majorType</FEATURE>

    
    <POSITION>0</POSITION>
    
    <VALUES>
      
          <VALUE>address</VALUE>
      <VALUE>cdg</VALUE>
      <VALUE>country_adj</VALUE>
      <VALUE>currency_unit</VALUE>
      <VALUE>date</VALUE>
      <VALUE>date_key</VALUE>
      <VALUE>date_unit</VALUE>
      <VALUE>facility</VALUE>
      <VALUE>facility_key</VALUE>
      <VALUE>facility_key_ext</VALUE>
      <VALUE>govern_key</VALUE>
      <VALUE>greeting</VALUE>
      <VALUE>ident_key</VALUE>
      <VALUE>jobtitle</VALUE>
      <VALUE>loc_general_key</VALUE>
      <VALUE>loc_key</VALUE>
      <VALUE>location</VALUE>
      <VALUE>number</VALUE>
      <VALUE>org_base</VALUE>
      <VALUE>org_ending</VALUE>
      <VALUE>org_key</VALUE>
      <VALUE>org_pre</VALUE>
      <VALUE>organization</VALUE>
      <VALUE>organization_noun</VALUE>
      <VALUE>percent</VALUE>
      <VALUE>person_ending</VALUE>
      <VALUE>person_first</VALUE>
      <VALUE>person_full</VALUE>
      <VALUE>phone_prefix</VALUE>
      <VALUE>sport</VALUE>
      <VALUE>spur</VALUE>
      <VALUE>spur_ident</VALUE>
      <VALUE>stop</VALUE>
      <VALUE>surname</VALUE>
      <VALUE>time</VALUE>
      <VALUE>time_modifier</VALUE>
      <VALUE>time_unit</VALUE>
      <VALUE>title</VALUE>
      <VALUE>year</VALUE>
    </VALUES>
    
  </ATTRIBUTE>

  <ATTRIBUTE>
    
    <NAME>Lookup_MT(1)</NAME>
    
    <TYPE>Lookup</TYPE>
    
    <FEATURE>majorType</FEATURE>

    
    <POSITION>1</POSITION>

    
    <VALUES>
      
      <VALUE>address</VALUE>
      <VALUE>cdg</VALUE>
      <VALUE>country_adj</VALUE>
      <VALUE>currency_unit</VALUE>
      <VALUE>date</VALUE>
      <VALUE>date_key</VALUE>
      <VALUE>date_unit</VALUE>
      <VALUE>facility</VALUE>
      <VALUE>facility_key</VALUE>
      <VALUE>facility_key_ext</VALUE>
      <VALUE>govern_key</VALUE>
      <VALUE>greeting</VALUE>
      <VALUE>ident_key</VALUE>
      <VALUE>jobtitle</VALUE>
      <VALUE>loc_general_key</VALUE>
      <VALUE>loc_key</VALUE>
      <VALUE>location</VALUE>
      <VALUE>number</VALUE>
      <VALUE>org_base</VALUE>
      <VALUE>org_ending</VALUE>
      <VALUE>org_key</VALUE>
      <VALUE>org_pre</VALUE>
      <VALUE>organization</VALUE>
      <VALUE>organization_noun</VALUE>
      <VALUE>percent</VALUE>
      <VALUE>person_ending</VALUE>
      <VALUE>person_first</VALUE>
      <VALUE>person_full</VALUE>
      <VALUE>phone_prefix</VALUE>
      <VALUE>sport</VALUE>
      <VALUE>spur</VALUE>
      <VALUE>spur_ident</VALUE>
      <VALUE>stop</VALUE>
      <VALUE>surname</VALUE>
      <VALUE>time</VALUE>
      <VALUE>time_modifier</VALUE>
      <VALUE>time_unit</VALUE>
      <VALUE>title</VALUE>
      <VALUE>year</VALUE>
    </VALUES>
    
  </ATTRIBUTE>

  <ATTRIBUTE>
    
    <NAME>POS_category(-1)</NAME>
    
    <TYPE>Token</TYPE>
    
    <FEATURE>category</FEATURE>

    
    <POSITION>-1</POSITION>

    
    <VALUES>
      
        <VALUE>NN</VALUE>
        <VALUE>NNP</VALUE>
        <VALUE>NNPS</VALUE>
        <VALUE>NNS</VALUE>
        <VALUE>NP</VALUE>
        <VALUE>NPS</VALUE>
        <VALUE>JJ</VALUE>
        <VALUE>JJR</VALUE>
        <VALUE>JJS</VALUE>
        <VALUE>JJSS</VALUE>
        <VALUE>RB</VALUE>
        <VALUE>RBR</VALUE>
        <VALUE>RBS</VALUE>
        <VALUE>VB</VALUE>
        <VALUE>VBD</VALUE>
        <VALUE>VBG</VALUE>
        <VALUE>VBN</VALUE>
        <VALUE>VBP</VALUE>
        <VALUE>VBZ</VALUE>
        <VALUE>FW</VALUE>
        <VALUE>CD</VALUE>
        <VALUE>CC</VALUE>
        <VALUE>DT</VALUE>
        <VALUE>EX</VALUE>
        <VALUE>IN</VALUE>
        <VALUE>LS</VALUE>
        <VALUE>MD</VALUE>
        <VALUE>PDT</VALUE>
        <VALUE>POS</VALUE>
        <VALUE>PP</VALUE>
        <VALUE>PRP</VALUE>
        <VALUE>PRP$</VALUE>
        <VALUE>PRPR$</VALUE>
        <VALUE>RP</VALUE>
        <VALUE>TO</VALUE>
        <VALUE>UH</VALUE>
        <VALUE>WDT</VALUE>
        <VALUE>WP</VALUE>
        <VALUE>WP$</VALUE>
        <VALUE>WRB</VALUE>
        <VALUE>SYM</VALUE>
        <VALUE>\"</VALUE>
        <VALUE>#</VALUE>
        <VALUE>$</VALUE>
        <VALUE>’</VALUE>
        <VALUE>(</VALUE>
        <VALUE>)</VALUE>
        <VALUE>,</VALUE>
        <VALUE>--</VALUE>
        <VALUE>-LRB-</VALUE>
        <VALUE>.</VALUE>
        <VALUE>’’</VALUE>
        <VALUE>:</VALUE>
        <VALUE>::</VALUE>
        <VALUE>‘</VALUE>
    </VALUES>
    
  </ATTRIBUTE>

  <ATTRIBUTE>
    
    <NAME>POS_category(0)</NAME>
    
    <TYPE>Token</TYPE>
    
    <FEATURE>category</FEATURE>

    
    <POSITION>0</POSITION>

    
    <VALUES>
      
        <VALUE>NN</VALUE>
        <VALUE>NNP</VALUE>
        <VALUE>NNPS</VALUE>
        <VALUE>NNS</VALUE>
        <VALUE>NP</VALUE>
        <VALUE>NPS</VALUE>
        <VALUE>JJ</VALUE>
        <VALUE>JJR</VALUE>
        <VALUE>JJS</VALUE>
        <VALUE>JJSS</VALUE>
        <VALUE>RB</VALUE>
        <VALUE>RBR</VALUE>
        <VALUE>RBS</VALUE>
        <VALUE>VB</VALUE>
        <VALUE>VBD</VALUE>
        <VALUE>VBG</VALUE>
        <VALUE>VBN</VALUE>
        <VALUE>VBP</VALUE>
        <VALUE>VBZ</VALUE>
        <VALUE>FW</VALUE>
        <VALUE>CD</VALUE>
        <VALUE>CC</VALUE>
        <VALUE>DT</VALUE>
        <VALUE>EX</VALUE>
        <VALUE>IN</VALUE>
        <VALUE>LS</VALUE>
        <VALUE>MD</VALUE>
        <VALUE>PDT</VALUE>
        <VALUE>POS</VALUE>
        <VALUE>PP</VALUE>
        <VALUE>PRP</VALUE>
        <VALUE>PRP$</VALUE>
        <VALUE>PRPR$</VALUE>
        <VALUE>RP</VALUE>
        <VALUE>TO</VALUE>
        <VALUE>UH</VALUE>
        <VALUE>WDT</VALUE>
        <VALUE>WP</VALUE>
        <VALUE>WP$</VALUE>
        <VALUE>WRB</VALUE>
        <VALUE>SYM</VALUE>
        <VALUE>\"</VALUE>
        <VALUE>#</VALUE>
        <VALUE>$</VALUE>
        <VALUE>’</VALUE>
        <VALUE>(</VALUE>
        <VALUE>)</VALUE>
        <VALUE>,</VALUE>
        <VALUE>--</VALUE>
        <VALUE>-LRB-</VALUE>
        <VALUE>.</VALUE>
        <VALUE>’’</VALUE>
        <VALUE>:</VALUE>
        <VALUE>::</VALUE>
        <VALUE>‘</VALUE>
    </VALUES>
    
  </ATTRIBUTE>

  <ATTRIBUTE>
    
    <NAME>POS_category(1)</NAME>
    
    <TYPE>Token</TYPE>
    
    <FEATURE>category</FEATURE>

    
    <POSITION>1</POSITION>

    
    <VALUES>
      
        <VALUE>NN</VALUE>
        <VALUE>NNP</VALUE>
        <VALUE>NNPS</VALUE>
        <VALUE>NNS</VALUE>
        <VALUE>NP</VALUE>
        <VALUE>NPS</VALUE>
        <VALUE>JJ</VALUE>
        <VALUE>JJR</VALUE>
        <VALUE>JJS</VALUE>
        <VALUE>JJSS</VALUE>
        <VALUE>RB</VALUE>
        <VALUE>RBR</VALUE>
        <VALUE>RBS</VALUE>
        <VALUE>VB</VALUE>
        <VALUE>VBD</VALUE>
        <VALUE>VBG</VALUE>
        <VALUE>VBN</VALUE>
        <VALUE>VBP</VALUE>
        <VALUE>VBZ</VALUE>
        <VALUE>FW</VALUE>
        <VALUE>CD</VALUE>
        <VALUE>CC</VALUE>
        <VALUE>DT</VALUE>
        <VALUE>EX</VALUE>
        <VALUE>IN</VALUE>
        <VALUE>LS</VALUE>
        <VALUE>MD</VALUE>
        <VALUE>PDT</VALUE>
        <VALUE>POS</VALUE>
        <VALUE>PP</VALUE>
        <VALUE>PRP</VALUE>
        <VALUE>PRP$</VALUE>
        <VALUE>PRPR$</VALUE>
        <VALUE>RP</VALUE>
        <VALUE>TO</VALUE>
        <VALUE>UH</VALUE>
        <VALUE>WDT</VALUE>
        <VALUE>WP</VALUE>
        <VALUE>WP$</VALUE>
        <VALUE>WRB</VALUE>
        <VALUE>SYM</VALUE>
        <VALUE>\"</VALUE>
        <VALUE>#</VALUE>
        <VALUE>$</VALUE>
        <VALUE>’</VALUE>
        <VALUE>(</VALUE>
        <VALUE>)</VALUE>
        <VALUE>,</VALUE>
        <VALUE>--</VALUE>
        <VALUE>-LRB-</VALUE>
        <VALUE>.</VALUE>
        <VALUE>’’</VALUE>
        <VALUE>:</VALUE>
        <VALUE>::</VALUE>
        <VALUE>‘</VALUE>
    </VALUES>
    
  </ATTRIBUTE>

  <ATTRIBUTE>
    
    <NAME>Entity(0)</NAME>
    
    <TYPE>Entity</TYPE>
    
    <POSITION>0</POSITION>

    <CLASS/>
    
  </ATTRIBUTE>

  </DATASET>

  <ENGINE>
    <WRAPPER>gate.creole.ml.weka.Wrapper</WRAPPER>
    <OPTIONS>
        <CLASSIFIER OPTIONS="-S -C 0.25 -B -M 2">weka.classifiers.trees.J48</CLASSIFIER>
        <CONFIDENCE-THRESHOLD>0.85</CONFIDENCE-THRESHOLD>
    </OPTIONS>
  </ENGINE>
</ML-CONFIG>

¹This is only true for GATE 5.2 or later; in earlier versions all modes were unsafe for multiple instances of the PR.

²The SVM package SV M^light can be downloaded from http://svmlight.joachims.org/.

[next] [prev] [prev-tail] [front] [up]