Log in Help
Homereleasesgate-5.0-build3244-ALLdoctao 〉 splitch11.html

Chapter 11
Machine Learning API [#]

This chapter describes a new machine learning layer in GATE. The current implementation are mainly for the three types of NLP learning, namely chunk recognition (e.g. named entity recognition), text classification and relation extraction, which cover many NLP learning problems. The implementation for chunk recognition is based on our work using the support vector machines (SVM) for information extraction [Li et al. 05a]. The text classification is based on our works on opinionated sentence classification and patent document classification (see [Li et al. 07b] and [Li et al. 07c], respectively). The relation extraction is based on our work on named entity relation extraction [Wang et al. 06].

The machine learning API, given a set of documents, can also produce several feature files containing linguistic features and feature vectors, respectively, and labels if there are any in the documents. It can also produce the so-called document-term matrix and n-gram based language model. Those features files are in text format and can be used outside of the GATE. Hence user can use those features off-line for her/his own purpose, e.g. evaluating the new learning algorithms.

The learning API also provides the facilities for active learning based on the learning algorithm Support Vector Machines (SVM), mainly ranking the unlabelled documents according to the confidence scores of the current SVM models for those documents.

The primary learning algorithm implemented is SVM, which has achieved state of the art performances for many NLP learning tasks. The training of the SVM uses the Java version of the SVM package LibSVM [CC001]. The application of the SVM is implemented by ourselves. Moreover, the ML implementation provides an interface to the open-source machine learning package Weka [Witten & Frank 99]. Therefore it can use the machine learning algorithms implemented in Weka. The three widely used learning algorithms, naive Bayes method, KNN and the decision tree algorithm C4.5 are available in the current implementation.

In order to use the machine learning (ML) API, user mainly has to do three things. First user has to annotate some documents with the labels that s/he wants the learning system to annotate in new documents. Those label annotations should be the GATE annotations. Secondly, user may need to pre-process the documents to obtain the linguistic features for the learning. Again these features should be in the form of the GATE annotations. The GATE’s plug-in ANNIE would be very helpful for producing the linguistic features. Other plug-ins such as NP Chunker and parser may also be very helpful. Finally user has to create a configuration file for setting the ML API, e.g. selecting the learning algorithm and defining the linguistic features used by learning. Note that user may not need to create the configuration file from scratch. Instead user can copy one of the three example files presented below and make modifications on it for one particular problem.

The rest of the chapter is organised as follows. Section 11.1 explains the ML in general and the specifications in GATE. Section 11.2 describes all the configuration settings of the ML API one by one, in particular all the elements in the configuration file for setting the ML API (e.g. the learning algorithm to be used and the options for the learning) and defining the NLP features for the user’s particular problem. Section 11.3 presents three exemplary settings respectively for the three types of NLP learning problems to illustrate the usage of this ML plug-in. Section 11.4 lists the steps of using the ML API. Finally Section 11.5 explains the outputs of the ML API for the four usage modes, namely the training, application, evaluation and producing feature files only, respectively, and in particular the format of the feature files and label list file produced by the ML API.

11.1 ML Generalities

There are two main types of ML, supervised learning and unsupervised learning. The supervised learning is more effective and much more widely used in the NLP. Classification is a particular example of supervised learning in which the set of training examples is split into multiple subsets (classes) and the algorithm attempts to distribute the new examples into the existing classes. This is the type of ML that is used in GATE and all further references to ML actually refer to classification.

An ML algorithm “learns” about a phenomenon by looking at a set of occurrences of that phenomenon that are used as examples. Based on these, a model is built that can be used to predict characteristics of future (and unforeseen) examples of the phenomenon.

An ML implementation has two modes of functioning: training and application. The training phase consists of building a model (e.g. statistical model, a decision tree, a rule set, etc.) from a dataset of already classified instances. During application, the model built during training is used to classify new instances.

The ML API in GATE is designed particularly for NLP learning. It can be used for the three types of NLP learning, text classification, chunk recognition, and relation extraction, which cover many NLP learning tasks.

From the ML’s point of view, typically the three types of NLP learning use different linguistic features and feature representations. For example, it has been recognised that for text classification the so-called tf - idf representation of the n-grams in the text is very effective by using some learning algorithms such as SVM. For chunk recognition identifying the first token and the end token of chunk by using the linguistic features of the token itself and the surrounding tokens is effective and efficient. Relation extraction needs considering both the linguistic features from each of the two terms involved in the relation and those features combined from the two terms.

The ML API implements the suitable feature representations for the three types of NLP learning. It also implements the Java codes or wrappers for the widely used ML algorithms including SVM, KNN, Naive Bayes, and the decision tree algorithm C4.5. In addition, given some documents, it can produce the NLP features and the feature vectors, which will be stored in the files. Hence, the users can use those features for evaluating the learning algorithms of their own for some NLP learning task or for any other further processing.

The rest of this section explains some basic definitions in ML and their specification in this GATE plug-in.

11.1.1 Some definitions

11.1.2 GATE-specific interpretation of the above definitions

11.2 The Batch Learning PR in GATE

Access to ML implementations is provided in GATE by the “Batch Learning PR” that handles the four usage modes, the training and application of ML model, the evaluation of learning on GATE documents1, and producing the feature files only. This PR is a Language Analyser so it can be used in all default types of GATE controllers.

In order to allow for more flexibility, all the configuration parameters for the PR are set through one external XML file, except the three learning modes which are selected through the normal PR parameterisation. The XML file contains both the configuration parameters of the ML API itself and the linguistic data (namely the definitions of the instance and attributes) used by the ML API. The XML file would be required to be specified when loading ML API plug-in into GATE.

The parent directory of the XML configuration file is called as working directory. A subdirectory of the working directory, named as “savedFiles”, will be created (if it does not exist when loading the ML API). All the files produced by the ML API, including the NLP features files, label list file, feature vector file and learned model file, will be stored in that subdirectory. The log file recording the information of one learning session is also in this directory.

In the following we first describe a few settings which are the parameters of the ML API plug-in. Then we explain those settings specified in the configuration file.

11.2.1 The settings not specified in the configuration file

For the sake of convenience, a few settings are not specified in the configuration file. Instead the user should specify them as loading or run-time parameters of the ML API plug-in, as in many other PRs.

Please note that, if the ML API is added into a GATE application as corpus pipeline, it does not process the documents in the corpus until the last document goes through the corpus pipeline, except in the APPLICATION mode2. When the last document in the corpus is going through the pipeline, the ML PR processes all the documents in the corpus. This kind of batch processing is necessary for batch learning algorithms such as the SVM. It also makes the application of learned model much more faster than the normal behaviour of a PR in corpus pipeline, namely every PR in the pipeline process the same document in the corpus before processing the next document. One important consequence of the batch implementation is that one cannot put some post-processing PRs directly after the ML PR in the same corpus pipeline. Instead, s/he should use the ML PR as the last PR in the corpus pipeline, and put all the post-processing PRs into another GATE application which post-processes the documents processed by the ML. However, this kind of limitation can be alleviated for the APPLICATION mode (see Footnote above for more details).

11.2.2 All the settings in the XML configuration file

The root element of the XML configuration file needs to be called “ML-CONFIG” and it must contain two basic elements, “DATASET” and “ENGINE”, and optionally other optional settings. In the following we first describe the optional settings, then the “ENGINE” element, and finally the “DATASET” element. In next section some examples of the XML configuration file are given for illustration. Please also refer to the configuration files in the test directory (i.e. plugs/learning/test/ under the main gate directory) for more examples.

The optional settings in the configuration file

The ML API provides a variety of optional settings, which facilitates different tasks. Every optional setting has a default value — if one optional setting is not specified in the configuration file, the ML API will adopt its default value. Each of the following optional settings can be set as an element in the XML configuration file.

The ENGINE element

The ENGINE element specifies which particular ML algorithm will be used and also allows the setting of options for that algorithm.

Note that for the SVM learning, user can choose one of the two learning engines. We will discuss the two SVM learning engines below. Another note is that the current implementation only allows the SVM learning engine use linear and polynomial kernels but not other types of kernels, despite the fact that the original SVM packages implemented other types of kernel. The main reason for that is that linear and polynomial kernels are popular in natural language learning and other types of kernel has been used in rare case. However, if you want to experiment with other types of kernel, you can do it by first running the learning plugin in GATE to produce the training and testing data, then convert those data into the specific format that a SVM package requires for input data, and finally run the SVM package on the data.

The configuration files in the test directory (i.e. plugs/learning/test/ under the main gate directory) contain the examples for setting the learning engine.

The ENGINE element in the configuration file is as
<ENGINE nickname=”X” implementationName=”Y” options=”Z”/>
It has three items:

The DATASET element

The DATASET element defines the type of annotation to be used as instance and the set of attributes that characterise all the instances.

An “INSTANCE-TYPE” sub-element is used to select the annotation type to be used for instances, and the attributes are defined by a sequence of attribute elements.

For example, if an “INSTANCE-TYPE” has a “Token” as its value, there will be one instance in the document per “Token”. This also means that the positions (see below) are defined in relation to Tokens. The “INSTANCE-TYPE” can be seen as the basic unit to be taken into account for machine learning.

Different NLP learning tasks may have different instance types and use different kinds of attribute elements. Chunking recognition often uses the token as instance type and the linguistic features of “Token” and other annotations as features. Text classification’s instance type is the text unit for classification, e.g. the whole document, or sentence, or token. If classifying a sequence of tokens, the n-grams representation of the tokens is often a good feature representation for many statistical learning algorithms. For relation extraction, the instance type is a pair of terms for the relation, and the features come from not only the linguistic features of each of the two terms but also those related to both terms.

A DATASET element should define an instance type sub-element, and an “ATTRIBUTE” sub-element or an “ATRRIBUTE_REL” sub-element as “Class”, and some linguistic feature related sub-elements. All the annotation types involved in the data set definition should be in the same Annotation Set. Each sub-element defining the linguistic features or class label is associated with one or more annotation types. Each annotation type used should specify one of its annotation feature — the values of that annotation feature are the linguistic features used as input to learning algorithm or the class labels for learning9. Note that if blank spaces are contained in the values of the annotation features, they will be replaced by the character “_” in each occurrence. So it is advisable that the values of the annotation features used, in particular for the class label, do not contain any blank space.

In the following we explain all the sub-elements one by one. Please also refer to the examples of configuration files presented in next section for the examples of the dataset definition. Note that each sub-element should have a unique name (which is different from other sub-element’s names) if it requires name, unless we state explicitly the other case.

11.3 Examples of configuration file for the three learning types

The following are three illustrated examples of configuration file for information extraction, sentence classification, and relation extraction, respectively. Note that the configuration file is in the XML format and should be stored in a file with name having “XML” extension.

The first example is for information extraction. The optional settings are in the first part. It first specifies the surround mode as “true”, as it is a kind of chunk learning. Then it specifies the filtering settings. The ratio’s value is “0.1” and the dis’s value is “near”, meaning that the 10% of negative examples which are closest to the learned SVM hyper-plane will be removed in the filtering stage before the learning for the information extraction. The thresholds of probabilities for the boundary tokens and information entity are set as “0.4” and “0.2”,respectively. The threshold of probability for the classification is also set as “0.5”, which, however, will not be used in the problem for chunk learning with the surround mode set as “true”. The multiClassification2Binary is set as “one-vs-others”, meaning that the ML API will convert the multi-class classification problem into some binary classification problems by using the one against all others approach. For the EVALUATION setting, the “2-fold” cross-validation will be used.

The second part is the sub-element ENGINE specifying the learning algorithm. The ML API will use the SVM learning implemented in the LibSVM. From the option settings it will use the linear kernel with the cost C as 0.7 and the cache memory as 100M. Additionally it will use the uneven margins SVM with the uneven margins parameter τ as 0.4.

The last part is the DATASET sub-element, defining the linguistic features used. It first specifies the “Token” annotation as instance type. The first ATTRIBUTELIST allows the token’s string as the feature of an instance. The range from “-5” to “5” means that the strings of the current token instance as well as its five preceding tokens and its five following tokens would be used as features for the current token instance. The next two attribute lists define the features based on the token’s capitalisation information and types, respectively. The ATTRIBUTELIST named as “Gaz” uses as features the values of the feature “majorType” of the annotation type “Lookup”. Finally the ATTRIBUTE feature defines the attribute class, because it has a sub-element <CLASS/>. The values of the feature “class” of the annotation type “Mention” are the class labels.

<?xml version="1.0"?>  
  <SURROUND value="true"/>  
  <FILTERING ratio="0.1" dis="near"/>  
  <PARAMETER name="thresholdProbabilityEntity" value="0.2"/>  
  <PARAMETER name="thresholdProbabilityBoundary" value="0.4"/>  
  <PARAMETER name="thresholdProbabilityClassification" value="0.5"/>  
  <multiClassification2Binary method="one-vs-others"/>  
  <EVALUATION method="kfold" runs="2"/>  
  <ENGINE nickname="SVM" implementationName="SVMLibSvmJava"  
        options=" -c 0.7 -t 0 -m 100 -tau 0.4  "/>  
       <RANGE from="-5" to="5"/>  
       <RANGE from="-5" to="5"/>  
       <RANGE from="-5" to="5"/>  
       <RANGE from="-5" to="5"/>  

The following is a configuration file for sentence classification. It first specifies the surround mode as “false”, because it is a text classification problem. The next two options allows the label list and the NLP feature list being updated from the training data. It also specifies the thresholds for entity and boundary of entity. Note that these two specifications will not be used in the text classification problems. However, the presences of them in the configuration file is not harmful to the ML API. The threshold of probability for classification is set as “0.5”, which would be used in the application of the learned model. The evaluation will use the hold-out test method. It will randomly select 66% documents from the corpus for training and other 34% documents for testing. It will have two runs of evaluation and the results are averaged over the two runs. Note that it does not specify the method of converting a multi-class classification problem into several binary class problem, meaning that it will adopt the default one (namely one against all others) if it needs.

The configuration file specifies the KNN as learning algorithm. It also specifies the number of neighbours used as 5. Of course other learning algorithms can be used as well. For example, the ENGINE element in the previous example, which specifies the SVM as learning algorithm, can be put into this configuration file to replace the current one.

In the DATASET element, the annotation “Sentence” is used as instance type. Two kinds of linguistic features are defined. One is the NGRAM. Another is the ATTRIBUTE. The n-gram is based on the annotation “Token”. It is uni-gram as its NUMBER element has the value 1. It is based on the two features “root” and “category” of the annotation “Token”. In another words, two tokens will be considered as the same term of the uni-gram if and only if they have the same “root” feature and the same “category” feature. The weight of the ngram was set as 10.0, meaning its contribution is ten times of the contribution of another feature, the sentence length. By the ATTRIBUTE the feature “sent_size” of the annotation “Sentence” is used as another NLP feature. Finally the values of the feature “class” of the annotation “Sentence” are the class labels.

<?xml version="1.0"?>  
  <SURROUND value="false"/>  
  <IS-LABEL-UPDATABLE value="true"/>  
  <PARAMETER name="thresholdProbabilityEntity" value="0.2"/>  
  <PARAMETER name="thresholdProbabilityBoundary" value="0.42"/>  
  <PARAMETER name="thresholdProbabilityClassification" value="0.5"/>  
  <EVALUATION method="holdout" runs="2" ratio="0.66"/>  
  <ENGINE nickname="KNN" implementationName="KNNWeka" options = " -k 5 "/>  

The last configuration file is for relation extraction. It does not specify any optional setting, meaning that it uses all the default values of those settings (see Section 11.2.2 for the default values of all possible settings). For example,

The configuration file specifies the learning algorithm as the Naive Bayes method implemented in Weka. However, other learning algorithms can be used as well.

The linguistic features used for relation extraction are more complicated than other two types of NLP learning. For relation learning, user must specify an annotation type covering the two arguments of relation as the instance type. User may also need to specify the annotation type for one argument if s/he wants to use the feature related only to the argument. In the DATASET sub-element, the instance type is specified as the annotation type “RE_INS”. Two arguments of the instance are specified by the two features “arg1” and “arg2” of the annotation type “RE_INS”, which should refer to the same values of the argument IDs as those specified in the ARG sub-elements of the sub-element FEATURE-ARG1 and FEATURE-ARG2. The linguistic features related to the first argument of relation are defined in the sub-element FEATURE-ARG1, in which the feature “string” of the annotation type “Token” of the first argument is used as one NLP feature for relation learning. Similarly the token’s form feature related to the second argument is also used as another NLP feature, as specified in the sub-element FEATURE-ARG2. By the first ATTRIBUTE_REL the feature “t12” of the annotation type RE_INS related to the relation instance defines another feature for relation learning. The second ATTRIBUTE_REL defines the label. It specifies the class labels as the values of the feature “Relation_type” of the annotation type ACERelation. The feature “MENTION_ARG1” of the type ACERelation should refer to the identification feature (namely the values of feature “MENTION_ID” of the annotation type “ACEEntity”) of the first argument annotation. The feature “MENTION_ARG2” should refer to the second argument annotation of the relation.

<?xml version="1.0"?>  
   <ENGINE nickname="NB" implementationName="NaiveBayesWeka"/>  

11.4 How to use the ML API

The ML API implements the procedure of using supervised machine learning for NLP, which generally has two steps, training and application. The training step learns some models from labelled data. The application step applies the learned models to the unlabelled data to add labels. Therefore, in order to use supervised ML for NLP, one should have some labelled data, which can be obtained either by manually annotating or from other resources. One also needs determine which linguistic features are used in the training (and the same features should be used in the application as well). Note that in the implementation all those features are based some GATE annotation type and one annotation feature of that type. Finally one should determine which learning algorithm will be used.

Based on the general procedure, we explain how to use the ML API step by step in the following.

  1. Annotate some documents with labels that you want ML API to learn. The labels should be represented by the values of some feature of one GATE annotation type.
  2. Determine the linguistic features that you want the ML API to use for learning.
  3. Use the PRs of the GATE to obtain those linguistic features so that the documents for training or application indeed contain the features. By using the ANNIE you can have many useful features. The other PRs such as GATE morphological analyser and the parsers may produce useful features as well. Sometimes you may need to write some Jape scripts to produce the features you want. Note that all the features are represented as values of the features of some GATE’s annotation types, which should be specified in the DATASET element of the configuration file.
  4. Create an XML configuration file for your learning problem. The file should contain one DATASET element specifying the NLP features used, one ENGINE element specifying the learning algorithm, and some optional settings as necessary. (Tip: it may be easier of copying one of the configuration files presented above and modifying them for your problem than writing a configuration file from scratch.)
  5. Load the training documents containing the required annotations (in one annotation set) representing the linguistic feautures and the label, and put them into a corpus.
  6. Load the ML API into GATE GUI. First you need load the plugin with name “learning” into GATE using the tool Manage CREOLE Plugins, if the GATE you are using haven’t done it yet. Then you can create a PR for the plugin from the “Batch Learning PR” in the existing PR list by providing a configuration file for the ML plugin. After that you can put the PR into a Corpus Pipeline application to use it. Add the corpus containing training documents into the application too. Set the inputASName as the annotation set containing the annotations for linguistic features and labels.
  7. Select the run-time parameter learningMode as “TRAINING” to learn a model from the training data. Or select the learningMode as “EVALUATION” to do evaluation on the training data and get some results. Note that when using the “EVALUATION” mode, please make sure that the outputASName is set with the same annotation set as the inputASName. (Tip: it may save your time if you first try the “EVALUATION” mode on a small number of documents to make sure that the ML API works well on your problem and outputs some reasonable results before training on the large data.)
  8. If you want to apply the learned model to the new documents, load those new document into GATE, and pre-process them using exactly the same PRs as those used for pre-processing the training documents. Then load the ML API with the same configuration file as for training into the GATE, select the learningMode as “APPLICATION”, and run the ML API on the corpus containing the new documents. The application results, namely the new annotations containing the labels, will be added into the annotation set specified by the outputASName.
  9. If you just want the feature files produced by the system and do not want to do any learning or application, select the learning mode “ProduceFeatureFilesOnly”.

11.5 The outputs of the ML API

There are several different types of the outputs of the ML API. First the ML API outputs some information about the learning settings. Those information will be printed in the Messages Window of the GATE GUI and also into the log file “logFileForNLPLearning.save”. The amount of the information displayed can be set via the VERBORSITY parameter in the XML configuration file. The main output of the learning system are the results of the ML API, which are different for different usage modes. In training mode the system produce the learned models. In applicaiton mode it annotates the documents using the learned model. In the evaluation mode it displays the evaluation results. Finally in ProduceFeatureFilesOnly mode it produces several feature files for the current corpus. In the following we explain the outputs for different learning modes, respectively.

Note that all the files produced by the ML API, including the log file, are in the sub-directory “savedFiles” of the working directory where the XML configuration file is in.

11.5.1 Training results

When the ML API is in the training mode, its main output is the learned model stored in one file named “learnedModels.save”. For the SVM, the learned model file is a text file. For the learning algorithms implemented in Weka, the model file is a binary file. The output also includes the feature files described in subsection 11.5.4.

11.5.2 Application results

The main application result is the annotations added into the documents. Those annotations are the results of applying the ML model to the documents. The type of the annotations is specified in the class element in the configuration file. The class label of one annotation is the value of the annotation feature also specified in the class element. The annotation has another feature with name as “prob” which presents the confidence level of the learned model on this annotation.

11.5.3 Evaluation results

The ML API outputs the evaluation results for each run of the evaluation and also the averaged results over all the runs. For each run, it first prints the message about the names of documents in corpus for training and testing, respectively. Then it displays the evaluation results of this run — first the results for each class label and then the micro-averaged results over all labels. For each label, it presents the name of label, the number of instances belonging to the label in the training data, and the F-measure results on the testing data — the numbers of correct, partial correct, spurious and missing instances in the testing data, and the two types of F-measures (Precision, Recall and F1): the strict and the lenient. The F-measure results are obtained by using the AnnotationDiff Tool which is described in Chapter 13. Finally the system presents the means of the results of all runs for each label and the micro-averaged results, respectively.

11.5.4 Feature files

The ML API produces several feature files from the documents with the GATE annotations. These feature files could be used by user for evaluating the learning algorithms not implemented in this plug-in. In the following we describe the formats of those feature files. Note that all the data files described in the following can be obtained by selecting the run time parameter learningMode as “ProduceFeatureFilesOnly”. But you may get some of data files when you use other learning mode.

NLP feature file. This file, named as NLPFeatureData.save, contains the NLP features of the instances, which are defined in the configuration file. The first few lines of an NLP feature file for information extraction are as

Class(es) Form(-1) Form(0) Form(1) Ortho(-1) Ortho(0) Ortho(1)  
0 ft-airlines-27-jul-2001.xml 512  
1 Number_BB _NA[-1] _Form_Seven _Form_UK[1] _NA[-1] _Ortho_upperInitial _Ortho_allCaps[1]  
1 Country_BB _Form_Seven[-1] _Form_UK _Form_airlines[1] _Ortho_upperInitial[-1]  
           _Ortho_allCaps _Ortho_lowercase[1]  
0 _Form_UK[-1] _Form_airlines _Form_including[1] _Ortho_allCaps[-1] _Ortho_lowercase  
0 _Form_airlines[-1] _Form_including _Form_British[1] _Ortho_lowercase[-1] _Ortho_lowercase  
1 Airline_BB _Form_including[-1] _Form_British _Form_Airways[1] _Ortho_lowercase[-1]  
           _Ortho_upperInitial _Ortho_upperInitial[1]  
1 Airline _Form_British[-1] _Form_Airways _Form_[1], _Ortho_upperInitial[-1]  
           _Ortho_upperInitial _NA[1]  
0 _Form_Airways[-1] _Form_, _Form_Virgin[1] _Ortho_upperInitial[-1] _NA  

The first line of the NLP feature file lists the names of all features used. The number in the parenthesis following a feature name indicates the position of the feature. For example, “Form(-1)” means the form of token which is immediately before the current token, and “Form(0)” means the form of the current token. The NLP features for all instances are listed for one document following another. For one document, the first line shows the index of the document, the document’s name and the number of instances in the document, as shown in the second line above. Then each of the following lines is for each of the instances in the document, with the order of their occurrences in the document. For each line for one instance, the first item is a number n, representing the number of class labels of the instance. Then the following n items are the labels. If the current instance is the first instance of an entity, its corresponding label has a suffix “_BB”. The other items following the label item(s) are the NLP features of the instance, in the order as listed in the first line of the file. Each NLP feature contains the feature’s name and value, separated by “_”. At the end of one NLP feature, there may be an integer in the parenthesis “[]”, which represents the position of the feature relative to the current instance. If there is no such an integer in “[]” at the end of one NLP feature, then the feature is at the position 0.

Feature vector file. It has the file name featureVectorsData.save and stores the feature vector in the sparse format for each instance. The first few lines of the feature vector file corresponding to the NLP feature file shown above are as

0 512 ft-airlines-27-jul-2001.xml  
1 2 1 2 439:1.0 761:1.0 100300:1.0 100763:1.0  
2 2 3 4 300:1.0 763:1.0 50439:1.0 50761:1.0 100440:1.0 100762:1.0  
3 0 440:1.0 762:1.0 50300:1.0 50763:1.0 100441:1.0 100762:1.0  
4 0 441:1.0 762:1.0 50440:1.0 50762:1.0 100020:1.0 100761:1.0  
5 1 5 20:1.0 761:1.0 50441:1.0 50762:1.0 100442:1.0 100761:1.0  
6 1 6 442:1.0 761:1.0 50020:1.0 50761:1.0 100066:1.0  
7 0 66:1.0 50442:1.0 50761:1.0 100443:1.0 100761:1.0

The feature vectors are also listed for one document after another. For one document, the first line shows the index of the document, the number of instances in the document, and the document’s name. Each of the following lines is for each of the instances in the document. The first item in the line of one instance is the index of the instance in the document. The second item is a number n, representing the number of labels the instance has. The following n items are the labels’ indexes of the instance.

For the text classification and relation learning, the label’s index comes directly from the label list file described below. For the chunk learning, the label’s index presented in the feature vector file is a bit complicated, as explained in the following. If an instance (e.g. token) is the first one of a chunk with a label k, then the instance has the label’s index as 2 * k - 1, as shown in the fifth instance. If it is the last instance of the chunk, it has the label’s index as 2 * k, as shown in the sixth instance. If the instance is both the first one and the last one of the chunk (namely the chunk consists of one instance), it has two label indexes, 2 * k - 1 and 2 * k, as shown in the first (and second) instance.

The items following the label item are the non-zero components of the feature vector. Each component is represented by two numbers separated by “:”. The first number is the dimension of the component in the feature vector, and the second one is the value of the component.

Label list file. It has the name LabelsList.save and stores a list of labels and their indexes. The following is a part of label list. Each line shows one label name and its index in the label list.

Airline 3  
Bank 13  
CalendarMonth 11  
CalendarYear 10  
Company 6  
Continent 8  
Country 2  
CountryCapital 15  
Date 21  
DayOfWeek 4

NLP feature list. It has the name NLPFeaturesList.save and contains a list of NLP features and their indexes in the list. The following are the first few lines of an NLP feature list file.

_EntityType_Date 13 1731  
_EntityType_Location 170 1081  
_EntityType_Money 523 3774  
_EntityType_Organization 12 2387  
_EntityType_Person 191 421  
_EntityType_Unknown 76 218  
_Form_’’ 112 775  
_Form_\$ 527 74  
_Form_’ 508 37  
_Form_’s 63 731  
_Form_( 526 111

The first line of the file shows the number of instances from which the NLP features were collected. The number of instances will be used for the computation of the so-called idf in document or sentence classification. The following lines are for the NLP features. Each line is for one unique feature. The first item in the line represents the NLP feature, which is a combination of the feature’s name defined in the configuration file and the value of the feature. The second item is a positive integer, representing the index of the feature in the list. The last item is the number of instances that the feature occurs, which is needed for computing the idf.

N-grams (or language model) file. The file has the name NgramList.save, which only can be produced by selecting the learning mode as “ProduceFeatureFilesOnly”. In order to produce the n-gram data, user may just use a very simple configuration file, e.g. it could only contain the DATASET element, and the data element could only contain an NGRAM element to specify the type of n-gram and the INSTANCE-TYPE element to define the part of document from which the n-gram data are created. The NGRAM element in configuration file specifies what type of n-grams the ML PR produces (see Section 11.2.2 for the explanation of the n-gram definition). For example, if you specify a bi-gram based on the string form of Token, you will obtain a list of bi-gram from the corpus you used. The following are the first lines of a bi-gram list based on the token’s form and was obtained from 3 documents.

## The following 2-gram were obtained from 3 documents or examples  
Aug<>, 3  
Female<>; 3  
Human<>; 3  
2004<>Aug 3  
;<>Female 3  
.<>The 3  
of<>a 3  
)<>: 3  
,<>and 3  
to<>be 3  
;<>Human 3

The two terms in one bi-gram are separated by “<>”. The number following one n-gram is the number of occurrences of that n-gram in the corpus. The n-gram list is ordered according to the numbers of occurrence of n-gram terms. The most frequent terms in the corpus are in the beginning of the list.

The n-gram data produced can be based on any features of annotations available in the documents. Hence it can not only produce the conventional n-gram data based on the token’s form or lemma, but also the n-gram based on e.g. token’s POS, or combination of token’s POS and form, or any feature of sentence annotation (see Section 11.2.2 for how to define different types of n-gram). Therefore one can regard the n-gram defined and produced by the ML PR as some kind of generalised n-gram.

Document-term matrix file. The file has the name documentByTermMatrix.save, which only can be produced by selecting the learning mode as “ProduceFeatureFilesOnly”. Document-term matrix presents the weights of terms appearing in each of documents (see Section 9.19 for more explanations). Currently three types of weight are implemented, the binary, term frequency (tf), and tf-idf. The binary weight is the simple one – it is 1 for one term appearing in document and 0 if one term is not in document. tf refers to the number of occurrences of one term in document. tf-idf is popular in information retrieval and text mining. It is a multiplication of tf and idf. idf refers to inverse document frequency, defining as

idfi = log|{d  : t ∈ d }|
            j   i   j

where |D| is the total number of documents in the corpus, and |{dj : ti ∈ dj}| is the number of documents in which the term ti appears. The type of weight is specified by the sub-element ValueTypeNgram in the DATASET element in configuration file (see Section 11.2.2).

Like the n-gram data, in order to produce the document-term matrix, user may just use a very simple configuration file, e.g. it could only contain the DATASET element, and the data element could only contain two elements, the INSTANCE-TYPE element to define the part of document from which the terms are counted, and a NGRAM element to specify the type of n-gram. As said above, the element ValueTypeNgram specifies the type of value used in the matrix. If it is not presented, the default type tf-idf will be used. The conventional document-term matrix can be produced by the uni-gram based on token’s form or lemma and the instance-type covering the whole document. Any other choices for the n-gram and the instance type will result in some generalised document-term matrix.

The following was extracted from the beginning of a document-term matrix file, produced by uni-gram based on token’s form. It presents a part of terms and their tf values in the document named “27.xml”. One term and its tf are separated by “:”. The terms are ranked in alphabetic order.

0 Documentname="27.xml", has 1 parts:  
":2 (:6 ):6 ,:14 -:1 .:16 /:1 124:1 2004:1 22:1 29:1 330:1 54:1 8:2 ::5  
;:11 Abstract:1 Adaptation:1 Adult:1 Atopic:2 Attachment:3 Aug:1  
Bindungssicherheit:1 Cross-:1 Dermatitis:2 English:1 F-SOZU:1 Female:1  
Human:1 In:1 Index:1 Insecure:1 Interpersonal:1 Irrespective:1 It:1 K-:1  
Lebensqualitat:1 Life:1 Male:1 NSI:2 Neurodermitis:2 OT:1 Original:1  
Patients:1 Psychological:1 Psychologie:1 Psychosomatik:1 Psychotherapie:1  
Quality:1 Questionnaire:1 RSQ:1 Relations:1 Relationship:1 SCORAD:1 Scales:1  
Sectional:1 Securely:1 Severity:2 Skindex-:1 Social:1 Studies:1 Suffering:1  
Support:1 The:1 Title:1 We:3 [:1 ]:1 a:4 absence:1 affection:1 along:2  
amount:1 an:1 and:9 as:1 assessed:1 association:2 atopic:5 attached:7

A list of names of documents processed. The file has the name docsName.save, which only can be produced by selecting the learning mode as “ProduceFeatureFilesOnly”. It contains the names of all the documents processed. The first line shows the number of documents in the list. Then each line list one document’s name. The first lines of one file are showed in the following.


A list of names of the selected documents for active learning purpose. The file has the name ALSelectedDocs.save. It is pure text file. It will be produced by selecting the learning mode as “ProduceFeatureFilesOnly”. The file contain the names of documents which have been selected for annotating and training in the active learning process. It is used by the “RankingDocsForAL” learning mode to exclude those selected documents from the ranked documents for active learning purpose. When one or more documents are selected for annotating and training, their names should be put into this file, one line per document’s name.

A list of names of ranked documents for active learning purpose. The file has the name ALRankedDocs.save. It can only be produced by selecting the learning mode as “RankingDocsForAL”. The file contains the list of names of the documents ranked for active learnig, according to their usefulness for learning. Those in the front of the list are the most useful documents for learning. The first line in the file shows the total number of documents in the list. Each of other lines in the fils lists one document and the averaged confidence score for classifiying the documents. An example of the file are showed in the following.

ft-airlines-27-jul-2001.xml_000201 8.61744  
ft-bank-of-england-02-aug-2001.xml_000221 8.672693  
ft-airtours-08-aug-2001.xml_000211 9.82562

1For the evaluation mode the system divides the corpus into two parts, uses one part to learn a model, applies the model to another part, and finally output the results on the testing part (see Sub-section 11.2.2 for the evaluation methods and Section 11.5 for an explanation of the evaluation results

2In the APPLICATION mode the number of documents processed by one time of application can be specified by one parameter in the configuration file and the default value of the parameter is 1. If the parameter is set as 1, then the ML API applies the learned model to the documents in the corpus one by one. In other word, in this case the ML API has the normal behaviour of a GATE PR and can be followed by other PRs in a pipeline. On the other hand, if assigning the parameter a larger value, usually the application would be faster but may probably consume more computer memory.

3For the k-fold cross-validation the system first gets a ranked list of documents in alphabetic order of the documents’ names, then segments the document list into k partitions of equal size, and finally uses each of the partitions as testing set and other documents as training set. For hold-out test, the system randomly selected some documents as testing data and uses all other documents as training data.

4Probably the two methods have different names in some publications. But the methods are the same. Suppose we have a multi-class classification problem with n classes. For the one against others method, one binary classification problem is derived for each of the n classes, which has the examples belonging to the class considered as positive examples and all other examples in training set as negative examples. In contrast, for the one against another one method, one binary classification problem is derived for each pair (c1,c2) of the n classes, in which the training examples belonging to the class c1 are the positive examples and those belonging to another class c2 are the negative examples.

5The SVM package SV Mlight can be downloaded from http://svmlight.joachims.org/.

6Because SVMLibSvmJava runs the SVM learning inside the GATE, while SVMExec runs the SVM learning outside of the GATE as a separated process.

7An example of the options for SVMExec is ”-c 0.7 -t 0 -m 100 -v 0 -tau 0.6 /yaoyong/software/svm-light/data_svm.dat /yaoyong/software/svm-light/model_svm.dat”, meaning that the svm learning uses linear kernel, the uneven margins parameter is set as 0.6, and two data files /yaoyong/software/svm-light/data_svm.dat and /yaoyong/software/svm-light/model_svm.dat for writing and reading data. Note that both the data files specified here are tempory files which are used only by the svm-light training program, can be in anywhere in your computer, and are independent of the data files produced by the GATE learning plugin.

8For example, executableTraining=”/yaoyong/software/svm-light/svm_learn.exe ” specifies one particular svm_learn.exe obtained from the package SV Mlight.

9For an “‘ATTRIBUTE” sub-element, if you do not specify the annotation feature, this sub-element will be ignored by the ML API. Therefore, if one annotation type you want to use does not have any annotation feature, you should add one annotation feature to it and assign one and the same value to the feature for all annotations of that type.

10Given the current instance annotation A, the position information is used to determine another instance annotation B. The linguistic feature for the instance A obtained by this element is the feature from the annotation C which is defined by this element and overlaps with the annotation B. The position 0 means that the instance annotation B is the A itself, -1 means that B is the first preceding annotation of A, and 1 means that B is the first annotation following A.
Please note that, if the annotation A contains more than one annotation C, e.g. C refers to token and A refers to named entity which contains more than one tokens, then there are more than one annotations C in the position 0 relative to the annotation A, in which case the current implementation just picks the first annotation of C. Therefore, if there is a possibility that the annotation A contains more than one annotations C, then it would be better of using e.g. a NGRAM type feature to characterise the feature than using an ATTRIBUTE feature.

11For example, if a user does sentence classification and s/he uses two features, the uni-gram of tokens in one sentence and the length of sentence and thinks that the uni-gram feature is more important than the length of sentence, then s/he can set the weight sub-element of the n-gram element with a number bigger than 1.0 (like 10.0).

12If a feature of a relation instance is determined only by one argument of the relation but not the two arguments, then the feature could be defined in the element FEATURES-ARG1 or the element FEATURES-ARG2. On the other hand, if a feature is related to both arguments, then it should be defined in the element ATTRIBUTE_REL.