Chapter 19
Machine Learning [#]
The machine learning technology in GATE is the Learning Framework plugin. This is available in the plugin manager.
A few words of introduction will be given in this section. However, much more extensive documentation can be found here, including a step by step tutorial:
https://gatenlp.github.io/gateplugin-LearningFramework/
19.1 Brief introduction to machine learning in GATE [#]
There are two main types of ML; supervised learning and unsupervised learning. Classification is a particular example of supervised learning, in which the set of training examples is split into multiple subsets (classes) and the algorithm attempts to distribute new examples into the existing classes. This is the type of ML that is used in GATE.
An ML algorithm ‘learns’ about a phenomenon by looking at a set of occurrences of that phenomenon that are used as examples. Based on these, a model is built that can be used to predict characteristics of future (unseen) examples of the phenomenon.
An ML implementation has two modes of functioning: training and application. The training phase consists of building a model (e.g. a statistical model, a decision tree, a rule set, etc.) from a dataset of already classified instances. During application, the model built during training is used to classify new instances.
The Learning Framework offers two main task types:
- Text classification classifies text into pre-defined categories. The process can be equally well applied at the document, sentence or token level. Typical examples of text classification might be document classification, opinionated sentence recognition, POS tagging of tokens and word sense disambiguation.
- Chunk recognition assigns a label or labels to chunks of text. These may be classified into one or several types (for example, Persons and Organizations may be done simultaneously). Examples of chunk recognition include named entity recognition (and more generally, information extraction), NP chunking and word segmentation.
Typically, the three types of NLP learning use different linguistic features and feature representations. For example, it has been recognised that for text classification the so-called tf − idf representation of n-grams is very effective (e.g. with SVM). For chunk recognition, identifying the start token and the end token of the chunk by using the linguistic features of the token itself and the surrounding tokens is effective and efficient.
Relation learning can be implemented using classification by first learning the entities involved in the relationship, then creating a new instance annotation for every possible pair, then classifying the pairs.
Some important concepts to be familiar with are:
- instance: an example of the studied phenomenon. An ML algorithm learns a model from a set of known instances, called a (training) dataset. It can then apply the learned model to another (application) dataset. In order to use ML in GATE, annotations are used to indicate the instances. For example, for chunking tasks, tokens are normally used, and are classified into the beginning, inside or outside of the entity. For classification of tweets into positive or negative, the instance annotation might be “tweet”.
- attribute: a characteristic of the instances. Each instance is defined by the values of its attributes. The set of possible attributes is well defined and is the same for all instances in the training and application datasets. ‘Feature’ is also often used, and should not be confused with GATE annotation features. An attribute must be the value of a named feature of a particular annotation type, which might be colocated with the instance, or be before or after it.
- class: The classification to be learned, such as “positive” or “negative” for a review, or a score, or whether an entity is a person or organization. ML is used to find the value of this attribute in the application dataset. Any attribute referring to the current instance can be marked as class attribute. The exception is for chunking tasks, where class is specified as a type, and turning that into a classification task is done for you behind the scenes.
In the usual case, in a GATE corpus pipeline application, documents are processed one at a time, and each PR is applied in turn to the document, processing it fully, before moving on to the next document. Machine learning PRs break from this rule. ML training algorithms typically run as a batch process over a training set, and require all the data to be fully prepared and passed to the algorithm in one go. This means that in training (or evaluation) mode, the PR will wait for all the documents to be processed and will then run as a single operation at the end. Therefore, learning PRs need to be positioned last in the pipeline. In application mode, the situation is slightly different, since the ML model has already been created, and the PR only applies it to the data, so the application PR can be positioned anywhere in the pipeline.