Log in Help
Homesaletao 〉 splitch19.html

Chapter 19
Machine Learning [#]

The machine learning technology in GATE is the Learning Framework plugin. This is available in the plugin manager.

A few words of introduction will be given in this section. However, much more extensive documentation can be found here, including a step by step tutorial:


19.1 Brief introduction to machine learning in GATE [#]

There are two main types of ML; supervised learning and unsupervised learning. Classification is a particular example of supervised learning, in which the set of training examples is split into multiple subsets (classes) and the algorithm attempts to distribute new examples into the existing classes. This is the type of ML that is used in GATE.

An ML algorithm ‘learns’ about a phenomenon by looking at a set of occurrences of that phenomenon that are used as examples. Based on these, a model is built that can be used to predict characteristics of future (unseen) examples of the phenomenon.

An ML implementation has two modes of functioning: training and application. The training phase consists of building a model (e.g. a statistical model, a decision tree, a rule set, etc.) from a dataset of already classified instances. During application, the model built during training is used to classify new instances.

The Learning Framework offers two main task types:

Typically, the three types of NLP learning use different linguistic features and feature representations. For example, it has been recognised that for text classification the so-called tf idf representation of n-grams is very effective (e.g. with SVM). For chunk recognition, identifying the start token and the end token of the chunk by using the linguistic features of the token itself and the surrounding tokens is effective and efficient.

Relation learning can be implemented using classification by first learning the entities involved in the relationship, then creating a new instance annotation for every possible pair, then classifying the pairs.

Some important concepts to be familiar with are:

In the usual case, in a GATE corpus pipeline application, documents are processed one at a time, and each PR is applied in turn to the document, processing it fully, before moving on to the next document. Machine learning PRs break from this rule. ML training algorithms typically run as a batch process over a training set, and require all the data to be fully prepared and passed to the algorithm in one go. This means that in training (or evaluation) mode, the PR will wait for all the documents to be processed and will then run as a single operation at the end. Therefore, learning PRs need to be positioned last in the pipeline. In application mode, the situation is slightly different, since the ML model has already been created, and the PR only applies it to the data, so the application PR can be positioned anywhere in the pipeline.