Log in Help
Homereleasesgate-5.1-build3431-ALLdoctao 〉 splitch19.html

Chapter 19
More (CREOLE) Plugins [#]

For the previous reader was none other than myself. I had already read this book long ago.

The old sickness has me in its grip again: amnesia in litteris, the total loss of literary memory. I am overcome by a wave of resignation at the vanity of all striving for knowledge, all striving of any kind. Why read at all? Why read this book a second time, since I know that very soon not even a shadow of a recollection will remain of it? Why do anything at all, when all things fall apart? Why live, when one must die? And I clap the lovely book shut, stand up, and slink back, vanquished, demolished, to place it again among the mass of anonymous and forgotten volumes lined up on the shelf.

But perhaps - I think, to console myself - perhaps reading (like life) is not a matter of being shunted on to some track or abruptly off it. Maybe reading is an act by which consciousness is changed in such an imperceptible manner that the reader is not even aware of it. The reader suffering from amnesia in litteris is most definitely changed by his reading, but without noticing it, because as he reads, those critical faculties of his brain that could tell him that change is occurring are changing as well. And for one who is himself a writer, the sickness may conceivably be a blessing, indeed a necessary precondition, since it protects him against that crippling awe which every great work of literature creates, and because it allows him to sustain a wholly uncomplicated relationship to plagiarism, without which nothing original can be created.

Three Stories and a Reflection, Patrick Suskind, 1995 (pp. 82, 86).

This chapter describes additional CREOLE resources which do not form part of ANNIE, and have not been covered in previous chapters.

19.1 Language Plugins [#]

There are plugins available for processing the following languages: French, German, Spanish, Italian, Chinese, Arabic, Romanian, Hindi and Cebuano. Some of the applications are quite basic and just contain some useful processing resources to get you started when developing a full application. Others (Cebuano and Hindi) are more like toy systems built as part of an exercise in language portability.

Note that if you wish to use individual language processing resources without loading the whole application, you will need to load the relevant plugin for that language in most cases. The plugins all follow the same kind of format. Load the plugin using the plugin manager in GATE Developer, and the relevant resources will be available in the Processing Resources set.

Some plugins just contain a list of resources which can be added ad hoc to other applications. For example, the Italian plugin simply contains a lexicon which can be used to replace the English lexicon in the default English POS tagger: this will provide a reasonable basic POS tagger for Italian.

In most cases you will also find a directory in the relevant plugin directory called data which contains some sample texts (in some cases, these are annotated with NEs).

19.1.1 French Plugin [#]

The French plugin contains two applications for NE recognition: one which includes the TreeTagger for POS tagging in French (french+tagger.gapp) , and one which does not (french.gapp). Simply load the application required from the plugins/Lang_French directory. You do not need to load the plugin itself from the GATE Developer’s Plugin Management Console. Note that the TreeTagger must first be installed and set up correctly (see Section 17.3 for details). Check that the runtime parameters are set correctly for your TreeTagger in your application. The applications both contain resources for tokenisation, sentence splitting, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Note that they are not intended to produce high quality results, they are simply a starting point for a developer working on French. Some sample texts are contained in the plugins/Lang_French/data directory.

19.1.2 German Plugin [#]

The German plugin contains two applications for NE recognition: one which includes the TreeTagger for POS tagging in German (german+tagger.gapp) , and one which does not (german.gapp). Simply load the application required from the plugins/Lang_German/resources directory. You do not need to load the plugin itself from the GATE Developer’s Plugin Management Console. Note that the TreeTagger must first be installed and set up correctly (see Section 17.3 for details). Check that the runtime parameters are set correctly for your TreeTagger in your application. The applications both contain resources for tokenisation, sentence splitting, gazetteer lookup, compound analysis, NE recognition (via JAPE grammars) and orthographic coreference. Some sample texts are contained in the plugins/Lang_German/data directory. We are grateful to Fabio Ciravegna and the Dot.KOM project for use of some of the components for the German plugin.

19.1.3 Romanian Plugin [#]

The Romanian plugin contains an application for Romanian NE recognition (romanian.gapp). Simply load the application from the plugins/Lang_Romanian/resources directory. You do not need to load the plugin itself from the GATE Developer’s Plugin Management Console. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Some sample texts are contained in the plugins/romanian/corpus directory.

19.1.4 Arabic Plugin [#]

The Arabic plugin contains a simple application for Arabic NE recognition (arabic.gapp). Simply load the application from the plugins/Lang_Arabic/resources directory. You do not need to load the plugin itself from the GATE Developer’s Plugin Management Console. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Note that there are two types of gazetteer used in this application: one which was derived automatically from training data (Arabic inferred gazetteer), and one which was created manually. Note that there are some other applications included which perform quite specific tasks (but can generally be ignored). For example, arabic-for-bbn.gapp and arabic-for-muse.gapp make use of a very specific set of training data and convert the result to a special format. There is also an application to collect new gazetteer lists from training data (arabic_lists_collector.gapp). For details of the gazetteer list collector please see Section 13.7.

19.1.5 Chinese Plugin [#]

The Chinese plugin contains two components: a simple application for Chinese NE recognition (chinese.gapp) and a component called “Chinese Segmenter”.

In order to use the former, simply load the application from the plugins/Lang_Chinese/resources directory. You do not need to load the plugin itself from the GATE Developer’s Plugin Management Console. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. The application makes use of some gazetteer lists (and a grammar to process them) derived automatically from training data, as well as regular hand-crafted gazetteer lists. There are also applications (listscollector.gapp, adj_collector.gapp and nounperson_collector.gapp) to create such lists, and various other application to perform special tasks such as coreference evaluation (coreference_eval.gapp) and converting the output to a different format (ace-to-muse.gapp).

For details on the Chinese Segmenter please see Section 19.12.

19.1.6 Hindi Plugin [#]

The Hindi plugin (‘Lang_Hindi’) contains a set of resources for basic Hindi NE recognition which mirror the ANNIE resources but are customised to the Hindi language. You need to have the ANNIE plugin loaded first in order to load any of these PRs. With the Hindi, you can create an application similar to ANNIE but replacing the ANNIE PRs with the default PRs from the plugin.

19.2 Flexible Exporter [#]

The Flexible Exporter enables the user to save a document (or corpus) in its original format with added annotations. The user can select the name of the annotation set from which these annotations are to be found, which annotations from this set are to be included, whether features are to be included, and various renaming options such as renaming the annotations and the file.

At load time, the following parameters can be set for the flexible exporter:

The following runtime parameters can also be set (after the file has been selected for the application):

19.3 Annotation Set Transfer [#]

The Annotation Set Transfer allows copying or moving annotations to a new annotation set if they lie between the beginning and the end of an annotation of a particular type (the covering annotation). For example, this can be used when a user only wants to run a processing resource over a specific part of a document, such as the Body of an HTML document. The user specifies the name of the annotation set and the annotation which covers the part of the document they wish to transfer, and the name of the new annotation set. All the other annotations corresponding to the matched text will be transferred to the new annotation set. For example, we might wish to perform named entity recognition on the body of an HTML text, but not on the headers. After tokenising and performing gazetteer lookup on the whole text, we would use the Annotation Set Transfer to transfer those annotations (created by the tokeniser and gazetteer) into a new annotation set, and then run the remaining NE resources, such as the semantic tagger and coreference modules, on them.

The Annotation Set Transfer has no loadtime parameters. It has the following runtime parameters:

For example, suppose we wish to perform named entity recognition on only the text covered by the BODY annotation from the Original Markups annotation set in an HTML document. We have to run the gazetteer and tokeniser on the entire document, because since these resources do not depend on any other annotations, we cannot specify an input annotation set for them to use. We therefore transfer these annotations to a new annotation set (Filtered) and then perform the NE recognition over these annotations, by specifying this annotation set as the input annotation set for all the following resources. In this example, we would set the following parameters (assuming that the annotations from the tokenise and gazetteer are initially placed in the Default annotation set).

19.4 Information Retrieval in GATE [#]

GATE comes with a full-featured Information Retrieval (IR) subsystem that allows queries to be performed against GATE corpora. This combination of IE and IR means that documents can be retrieved from the corpora not only based on their textual content but also according to their features or annotations. For example, a search over the Person annotations for ‘Bush’ will return documents with higher relevance, compared to a search in the content for the string ‘bush’. The current implementation is based on the most popular open source full-text search engine - Lucene (available at http://jakarta.apache.org/lucene/) but other implementations may be added in the future.

An Information Retrieval system is most often considered a system that accepts as input a set of documents (corpus) and a query (combination of search terms) and returns as input only those documents from the corpus which are considered as relevant according to the query. Usually, in addition to the documents, a proper relevance measure (score) is returned for each document. There exist many relevance metrics, but usually documents which are considered more relevant, according to the query, are scored higher.

Figure 19.1 shows the results from running a query against an indexed corpus in GATE.


Figure 19.1: Documents with scores, returned from a search over a corpus


doc1 w1,1 w1,2 ...... w1,k

doc2 w2,1 w2,1 ...... w2,k

... ... ... ...... ...

... ... ... ...... ...

docn wn, 1 wn,2 ...... wn,k

Table 19.1: An information retrieval document-term matrix

Information Retrieval systems usually perform some preprocessing one the input corpus in order to create the document-term matrix for the corpus. A document-term matrix is usually presented as in Table 19.1, where doci is a document from the corpus, termj is a word that is considered as important and representative for the document and wi,j is the weight assigned to the term in the document. There are many ways to define the term weight functions, but most often it depends on the term frequency in the document and in the whole corpus (i.e. the local and the global frequency). Note that the machine learning plugin described in Chapter 15 can produce such document-term matrix (for detailed description of the matrix produced, see Section 15.2.4).

Note that not all of the words appearing in the document are considered terms. There are many words (called ‘stop-words’) which are ignored, since they are observed too often and are not representative enough. Such words are articles, conjunctions, etc. During the preprocessing phase which identifies such words, usually a form of stemming is performed in order to minimize the number of terms and to improve the retrieval recall. Various forms of the same word (e.g. ‘play’, ‘playing’ and ‘played’) are considered identical and multiple occurrences of the same term (probably ‘play’) will be observed.

It is recommended that the user reads the relevant Information Retrieval literature for a detailed explanation of stop words, stemming and term weighting.

IR systems, in a way similar to IE systems, are evaluated with the help of the precision and recall measures (see Section 10.1 for more details).

19.4.1 Using the IR Functionality in GATE

In order to run queries against a corpus, the latter should be ‘indexed’. The indexing process first processes the documents in order to identify the terms and their weights (stemming is performed too) and then creates the proper structures on the local file system. These file structures contain indexes that will be used by Lucene (the underlying IR engine) for the retrieval.

Once the corpus is indexed, queries may be run against it. Subsequently the index may be removed and then the structures on the local file system are removed too. Once the index is removed, queries cannot be run against the corpus.

Indexing the Corpus

In order to index a corpus, the latter should be stored in a serial datastore. In other words, the IR functionality is unavailable for corpora that are transient or stored in a RDBMS datastores (though support for the latter may be added in the future).

To index the corpus, follow these steps:


Figure 19.2: Indexing a corpus by specifying the index location and indexed features (and content)

Querying the Corpus

To query the corpus, follow these steps:

Removing the Index

An index for a corpus may be removed at any time from the ‘Remove Index’ option of the context menu for the indexed corpus (right button click).

19.4.2 Using the IR API

The IR API within GATE Embedded makes it possible for corpora to be indexed, queried and results returned from any Java application, without using GATE Developer. The following sample indexes a corpus, runs a query against it and then removes the index.

2// open a serial datastore 
3SerialDataStore sds = 
8//set an AUTHOR feature for the test document 
9Document doc0 = Factory.newDocument(new URL("/tmp/documents/doc0.html")); 
10doc0.getFeatures().put("author","John Smith"); 
12Corpus corp0 = Factory.newCorpus("TestCorpus"); 
15//store the corpus in the serial datastore 
16Corpus serialCorpus = (Corpus) sds.adopt(corp0,null); 
19//index the corpus -  the content and the AUTHOR feature 
21IndexedCorpus indexedCorpus = (IndexedCorpus) serialCorpus; 
23DefaultIndexDefinition did = new DefaultIndexDefinition(); 
25  gate.creole.ir.lucene.LuceneIREngine.class.getName()); 
27did.addIndexField(new IndexField("content", 
28  new DocumentContentReader(), false)); 
29did.addIndexField(new IndexField("author", null, false)); 
33//the corpus is now indexed 
35//search the corpus 
36Search search = new LuceneSearch(); 
39QueryResultList res = search.search("+content:government +author:John"); 
41//get the results 
42Iterator it = res.getQueryResults(); 
43while (it.hasNext()) { 
44QueryResult qr = (QueryResult) it.next(); 
45System.out.println("DOCUMENT_ID=" + qr.getDocumentID() 
46  + ",   score=" + qr.getScore()); 

19.5 Websphinx Web Crawler [#]

The plugin ‘Web_Crawler_Websphinx’ enables GATE to build a corpus from a web crawl. The crawler itself is Websphinx.This is a JAVA based multi-threaded web crawler that can be customized for any application.

N.B. If you are using this plugin via an IDE, you may need to make sure that the websphinx.jar file is on the IDE’s classpath, or add to the IDE’s lib directory.

The basic idea is to be able to specify a source URL and a depth to build the initial corpus upon which further processing could be done. The PR itself provides a number of helpful features to set various parameters of the crawl.

19.5.1 Using the Crawler PR

In order to use the processing resource you first need to load the plugin using the plugin manager. Then load the crawler from the list of processing resources. User needs to create a corpus in which he or she wants to store crawled documents. In order to use the crawler, create a simple pipeline (note: do not create a corpus pipeline) and add the crawl PR to the pipeline.

Once the crawl PR is created there will be a number of parameters that can be set based on the PR required (see also Figure 19.3).


Figure 19.3: Crawler parameters

Once the parameters are set, the crawl can be run and the documents fetched are added to the specified corpus. Figure 19.4 shows the crawled pages added to the corpus.


Figure 19.4: Crawled pages added to the corpus

N.B. Note that you must use a simple Pipeline, and not a Corpus Pipeline. If you wish to process the crawled documents, you must build a second Corpus Pipeline. Note that from GATE Version 5.1, you could combine the two pipelines as follows. Build a simple Pipeline containing your Web Crawler. Let’s call this Pipeline Ps. Let’s say you set the Corpus on your crawler to C. Now build your processing corpus pipeline, which we will call Pc. Put the original pipeline Ps as its first PR. Set its corpus to be the corpus as before, C.

19.6 Google Plugin [#]

This plugin is no longer operational because the functionality, provided by Google, on which it depends, is no longer available.

19.7 Yahoo Plugin [#]

The Yahoo API is now integrated with GATE, and can be used as a PR-based plugin. This plugin, ‘Web_Search_Yahoo’, allows the user to query Yahoo and build a document corpus that contains the search results returned by Yahoo for the query. For more information about the Yahoo API please refer to http://developer.yahoo.com/search/. In order to use the Yahoo PR, you need to obtain an application ID.

The Yahoo PR can be used for a number of different application scenarios. For example, one use case is where a user wants to find the different named entities that can be associated with a particular individual. In this example, the user could build a collection of documents by querying Yahoo with the individual’s name and then running ANNIE over the collection. This would annotate the results and show the different Organization, Location and other entities that are associated with the query.

19.7.1 Using the YahooPR

In order to use the PR, you first need to load the plugin using the GATE Developer plugin manager. Once the PR is loaded, it can be initialized by creating an instance of a new PR. Here you need to specify the Yahoo Application ID. Please use the license key assigned to you by registering with Yahoo.

Once the Yahoo PR is initialized, it can be placed in a pipeline or a conditional pipeline application. This pipeline would contain the instance of the Yahoo PR just initialized as above. There are a number of parameters to be set at runtime:

Once the required parameters are set we can run the pipeline. This will then download all the URLs in the results and create a document for each. These documents would be added to the corpus.

19.8 WordNet in GATE [#]


Figure 19.5: WordNet in GATE – results for ‘bank’


Figure 19.6: WordNet in GATE

At present GATE supports only WordNet 1.6, so in order to use WordNet in GATE, you must first install WordNet 1.6 on your computer. WordNet is available at http://wordnet.princeton.edu/. The next step is to configure GATE to work with your local WordNet installation. Since GATE relies on the Java WordNet Library (JWNL) for WordNet access, this step consists of providing one special xml file that is used internally by JWNL. This file describes the location of your local copy of the WordNet 1.6 index files. An example of this wn-config.xml file is shown below:

<?xml version="1.0" encoding="UTF-8"?>  
<jwnl_properties language="en">  
  <version publisher="Princeton" number="1.6" language="en"/>  
  <dictionary class="net.didion.jwnl.dictionary.FileBackedDictionary">  
  <resource class="PrincetonResource"/>  

All you have to do is to replace the value of the dictionary_path parameter to point to your local installation of WordNet 1.6.

After configuring GATE to use WordNet, you can start using the built-in WordNet browser or API. In GATE Developer, load the WordNet plugin via the Plugin Management Console. Then load WordNet by selecting it from the set of available language resources. Set the value of the parameter to the path of the xml properties file which describes the WordNet location (wn-config).

Once Word Net is loaded in GATE Developer, the well-known interface of WordNet will appear. You can search Word Net by typing a word in the box next to to the label ‘SearchWord” and then pressing ‘Search’. All the senses of the word will be displayed in the window below. Buttons for the possible parts of speech for this word will also be activated at this point. For instance, for the word ‘play’, the buttons ‘Noun’, ‘Verb’ and ‘Adjective’ are activated. Pressing one of these buttons will activate a menu with hyponyms, hypernyms, meronyms for nouns or verb groups, and cause for verbs, etc. Selecting an item from the menu will display the results in the window below.

More information about WordNet can be found at http://wordnet.princeton.edu/

More information about the JWNL library can be found at http://sourceforge.net/projects/jwordnet

An example of using the WordNet API in GATE is available on the GATE examples page at http://gate.ac.uk/GateExamples/doc/index.html.

19.8.1 The WordNet API

GATE Embedded offers a set of classes that can be used to access the WordNet 1.6 Lexical Base. The implementation of the GATE API for WordNet is based on Java WordNet Library (JWNL). There are just a few basic classes, as shown in Figure 19.7. Details about the properties and methods of the interfaces/classes comprising the API can be obtained from the JavaDoc. Below is a brief overview of the interfaces:


Figure 19.7: The Wordnet API

19.9 Kea - Automatic Keyphrase Detection [#]

Kea is a tool for automatic detection of key phrases developed at the University of Waikato in New Zealand. The home page of the project can be found at http://www.nzdl.org/Kea/.

This user guide section only deals with the aspects relating to the integration of Kea in GATE. For the inner workings of Kea, please visit the Kea web site and/or contact its authors.

In order to use Kea in GATE Developer, the ‘Keyphrase_Extraction_Algorithm’ plugin needs to be loaded using the plugins management console. After doing that, two new resource types are available for creation: the ‘KEA Keyphrase Extractor’ (a processing resource) and the ‘KEA Corpus Importer’ (a visual resource associated with the PR).

19.9.1 Using the ‘KEA Keyphrase Extractor’ PR

Kea is based on machine learning and it needs to be trained before it can be used to extract keyphrases. In order to do this, a corpus is required where the documents are annotated with keyphrases. Corpora in the Kea format (where the text and keyphrases are in separate files with the same name but different extensions) can be imported into GATE using the ‘KEA Corpus Importer’ tool. The usage of this tool is presented in a subsection below.

Once an annotated corpus is obtained, the ‘KEA Keyphrase Extractor’ PR can be used to build a model:

  1. load a ‘KEA Keyphrase Extractor’
  2. create a new ‘Corpus Pipeline’ controller.
  3. set the corpus for the controller
  4. set the ‘trainingMode’ parameter for the PR to ‘true’
  5. run the application.

After these steps, the Kea PR contains a trained model. This can be used immediately by switching the ‘trainingMode’ parameter to ‘false’ and running the PR over the documents that need to be annotated with keyphrases. Another possibility is to save the model for later use, by right-clicking on the PR name in the right hand side tree and choosing the ‘Save model’ option.

When a previously built model is available, the training procedure does not need to be repeated, the existing model can be loaded in memory by selecting the ‘Load model’ option in the PR’s context menu.


Figure 19.8: Parameters used by the Kea PR

The Kea PR uses several parameters as seen in Figure 19.8:

The document to be processed.
The input annotation set. This parameter is only relevant when the PR is running in training mode and it specifies the annotation set containing the keyphrase annotations.
The output annotation set. This parameter is only relevant when the PR is running in application mode (i.e. when the ‘trainingMode’ parameter is set to false) and it specifies the annotation set where the generated keyphrase annotations will be saved.
the minimum length (in number of words) for a keyphrase.
the minimum number of occurrences of a phrase for it to be a keyphrase.
the maximum length of a keyphrase.
how many different keyphrases should be generated.
the type of annotations used for keyphrases.
should internal periods be disallowed.
if ‘true’ the PR is running in training mode; otherwise it is running in application mode.
should the K-frequency be used.

19.9.2 Using Kea Corpora

The authors of Kea provide on the project web page a few manually annotated corpora that can be used for training Kea. In order to do this from within GATE, these corpora need to be converted to the format used in GATE (i.e. GATE documents with annotations). This is possible using the ‘KEA Corpus Importer’ tool which is available as a visual resource associated with the Kea PR. The importer tool can be made visible by double-clicking on the Kea PR’s name in the resources tree and then selecting the ‘KEA Corpus Importer’ tab, see Figure 19.9.


Figure 19.9: Options for the ‘KEA Corpus Importer’

The tool will read files from a given directory, converting the text ones into GATE documents and the ones containing keyphrases into annotations over the documents.

The user needs to specify a few values:

Source Directory
the directory containing the text and key files. This can be typed in or selected by pressing the folder button next to the text field.
Extension for text files
the extension used for text fields (by default .txt).
Extension for keyphrase files
the extension for the files listing keyphrases.
Encoding for input files
the encoding to be used when reading the files.
Corpus name
the name for the GATE corpus that will be created.
Output annotation set
the name for the annotation set that will contain the keyphrases read from the input files.
Keyphrase annotation type
the type for the generated annotations.

19.10 Ontotext JapeC Compiler [#]

Note: the JapeC compiler does not currently support the new JAPE language features introduced in July–September 2008. If you need to use negation, the @length and @string accessors, the contextual operators within and contains, or any comparison operators other than ==, then you will need to use the standard JAPE transducer instead of JapeC.

JapeC is an alternative implementation of the JAPE language which works by compiling JAPE grammars into Java code. Compared to the standard implementation, these compiled grammars can be several times faster to run. At Ontotext, a modified version of the ANNIE sentence splitter using compiled grammars has been found to run up to five times as fast as the standard version. The compiler can be invoked manually from the command line, or used through the ‘Ontotext Japec Compiler’ PR in the Jape_Compiler plugin.

The ‘Ontotext Japec Transducer’ (com.ontotext.gate.japec.JapecTransducer) is a processing resource that is designed to be an alternative to the original Jape Transducer. You can simply replace gate.creole.Transducer with com.ontotext.gate.japec.JapecTransducer in your gate application and it should work as expected.

The Japec transducer takes the same parameters as the standard JAPE transducer:

the URL from which the grammar is to be loaded. Note that the Japec Transducer will only work on file: URLs. Also, the alternative binaryGrammarURL parameter of the standard transducer is not supported.
the character encoding used to load the grammars.
the ontology used for ontolog-aware transduction.

Its runtime parameters are likewise the same as those of the standard transducer:

the document to process.
name of the AnnotationSet from which input annotations to the transducer are read.
name of the AnnotationSet to which output annotations from the transducer are written.

The Japec compiler itself is written in Haskell. Compiled binaries are provided for Windows, Linux (x86) and Mac OS X (PowerPC), so no Haskell interpreter is required to run Japec on these platforms. For other platforms, or if you make changes to the compiler source code, you can build the compiler yourself using the Ant build file in the Jape_Compiler plugin directory. You will need to install the latest version of the Glasgow Haskell Compiler1 and associated libraries. The japec compiler can then be built by running:

../../bin/ant japec.clean japec

from the Jape_Compiler plugin directory.

19.11 Annotation Merging Plugin [#]

If we have annotations about the same subject on the same document from different annotators, we may need to merge the annotations.

This plugin implements two approaches for annotation merging.

MajorityVoting takes a parameter numMinK and selects the annotation on which at least numMinK annotators agree. If two or more merged annotations have the same span, then the annotation with the most supporters is kept and other annotations with the same span are discarded.

MergingByAnnotatorNum selects one annotation from those annotations with the same span, which the majority of the annotators support. Note that if one annotator did not create the annotation with the particular span, we count it as one non-support of the annotation with the span. If it turns out that the majority of the annotators did not support the annotation with that span, then no annotation with the span would be put into the merged annotations.

The annotation merging methods are available via the Annotation Merging plugin. The plugin can be used as a PR in a pipeline or corpus pipeline. To use the PR, each document in the pipeline or the corpus pipeline should have the annotation sets for merging. The annotation merging PR has no loading parameters but has several run-time parameters, explained further below.

The annotation merging methods are implemented in the GATE API, and are available in GATE Embedded as described in Section 7.17.


19.12 Chinese Word Segmentation [#]

Unlike English, Chinese text does not have a symbol (or delimiter) such as blank space to explicitly separate a word from the surrounding words. Therefore, for automatic Chinese text processing, we may need a system to recognise the words in Chinese text, a problem known as Chinese word segmentation. The plugin described in this section performs the task of Chinese word segmentation. It is based on our work using the Perceptron learning algorithm for the Chinese word segmentation task of the Sighan 20052. [Li et al. 05c]. Our Perceptron based system has achieved very good performance in the Sighan-05 task.

The plugin is called Lang_Chinese and is available in the GATE distribution. The corresponding processing resource’s name is Chinese Segmenter PR. Once you load the PR into GATE, you may put it into a Pipeline application. Note that it does not process a corpus of documents, but a directory of documents provided as a parameter (see description of parameters below). The plugin can be used to learn a model from segmented Chinese text as training data. It can also use the learned model to segment Chinese text. The plugin can use different learning algorithms to learn different models. It can deal with different character encodings for Chinese text, such as UTF-8, GB2312 or BIG5. These options can be selected by setting the run-time parameters of the plugin.

The plugin has five run-time parameters, which are described in the following.

The following PAUM models are distributed with plugins and are available as compressed zip files under the plugins/Lang_Chinese/resources/models directory. Please unzip them to use. In detail, those models were learned using the PAUM learning algorithm from the corpora provided by Sighan-05 bakeoff task.

As you can see, those models were learned using different training data and different Chinese text encodings of the same training data. The PKU training data are news articles published in mainland China and use simplified Chinese, while the AS training data are news articles published in Taiwan and use traditional Chinese. If your text are in simplified Chinese, you can use the models trained by the PKU data. If your text are in traditional Chinese, you need to use the models trained by the AS data. If your data are in GB2312 encoding or any compatible encoding, you need use the model trained by the corpus in GB2312 encoding.

Note that the segmented Chinese text (either used as training data or produced by this plugin) use the blank space to separate a word from its surrounding words. Hence, if your data are in Unicode such as UTF-8, you can use the GATE Unicode Tokeniser to process the segmented text to add the Token annotations into your text to represent the Chinese words. Once you get the annotations for all the Chinese words, you can perform further processing such as POS tagging and named entity recognition.

19.13 Copying Annotations between Documents [#]

Sometimes a document has two copies, each of which was annotated by different annotators for the same task. We may want to copy the annotations in one copy to the other copy of the document. This could be in order to use less resources, or so that we can process them with some other plugin, such as annotation merging or IAA. The Copy_Annots_Between_Docs plugin does exactly this.

The plugin is available with the GATE distribution. When loading the plugin into GATE, it is represented as a processing resource, Copy Anns to Another Doc PR. You need to put the PR into a Corpus Pipeline to use it. The plugin does not have any initialisation parameters. It has several run-time parameters, which specify the annotations to be copied, the source documents and target documents. In detail, the run-time parameters are:

The Corpus parameter of the Corpus Pipeline application containing the plugin specifies a corpus which contains the target documents. Given one (target) document in the corpus, the plugin tries to find a source document in the source directory specified by the parameter sourceFilesURL, according to the similarity of the names of the source and target documents. The similarity of two file names is calculated by comparing the two strings of names from the start to the end of the strings. Two names have greater similarity if they share more characters from the beginning of the strings. For example, suppose two target documents have the names aabcc.xml and abcab.xml and three source files have names abacc.xml, abcbb.xml and aacc.xml, respectively. Then the target document aabcc.xml has the corresponding source document aacc.xml, and abcab.xml has the corresponding source document abcbb.xml.

19.14 OpenCalais Plugin [#]

OpenCalais provides a web service for semantic annotation of text. The user submits a document to the web service, which returns entity and relations annotations in RDF, JSON or some other format. Typically, users integrate OpenCalais annotation of their web pages to provide additional links and ‘semantic functionality’. OpenCalais can be found at http://www.opencalais.com

The GATE OpenCalais PR submits a GATE document to the OpenCalais web service, and adds the annotations from the OpenCalais response as GATE annotations in the GATE document. It therefore provides OpenCalais semantic annotation functionality within GATE, for use by other PRs.

The PR only supports OpenCalais entities, not relations - although this should be straightforward for a competent Java programmer to add. Each OpenCalais entity is represented in GATE as an OpenCalais annotation, with features as given in the OpenCalais documentation.

The PR can be loaded with the CREOLE plugin manager dialog, from the creole directory in the gate distribution, gate/plugins/Tagger_OpenCalais. In order to use the PR, you will need to have an OpenCalais account, and request an OpenCalais service key. You can do this from the OpenCalais web site at http://www.opencalais.com. Provide your service key as an initialisation parameter when you create a new OpenCalais PR in GATE. OpenCalais make restrictions on the the number of requests you can make to their web service. See the OpenCalais web page for details.

Initialisation parameters are:

Various runtime parameters are available from the OpenCalais API, and are named the same as in that API. See the OpenCalais documentation for further details.

19.15 LingPipe Plugin [#]

LingPipe is a suite of Java libraries for the linguistic analysis of human language3. We have provided a plugin called ‘LingPipe’ with wrappers for some of the resources available in the LingPipe library. In order to use these resources, please load the ‘LingPipe’ plugin. Currently, we have integrated the following five processing resources.

Please note that most of the resources in the LingPipe library allow learning of new models. However, in this version of the GATE plugin for LingPipe, we have only integrated the application functionality. You will need to learn new models with Lingpipe outside of GATE. We have provided some example models under the ‘resources’ folder which were downloaded from LingPipe’s website. For more information on licensing issues related to the use of these models, please refer to the licensing terms under the LingPipe plugin directory.

The LingPipe system can be loaded from the GATE GUI by simply selecting the ‘Load LingPipe System’ menu item under the ‘File’ menu. This is similar to loading the ANNIE application with default values.

19.15.1 LingPipe Tokenizer PR [#]

As the name suggests this PR tokenizes document text and identifies the boundaries of tokens. Each token is annotated with an annotation of type ‘Token’. Every annotation has a feature called ‘length’ that gives a length of the word in number of characters. There are no initialization parameters for this PR. The user needs to provide the name of the annotation set where the PR should output Token annotations.

19.15.2 LingPipe Sentence Splitter PR [#]

As the name suggests, this PR splits document text in sentences. It identifies sentence boundaries and annotates each sentence with an annotation of type ‘Sentence’. There are no initialization parameters for this PR. The user needs to provide name of the annotation set where the PR should output Sentence annotations.

19.15.3 LingPipe POS Tagger PR [#]

The LingPipe POS Tagger PR is useful for tagging individual tokens with their respective part of speech tags. This PR requires a model which it then uses to tag the tokens. An example model is provided under the ‘resources’ folder of this plugin. It must be provided at initialization time. It is a prerequisites of this PR that the document is processed with a Tokenizer and a Sentence Splitter. In other words, it expects annotations of type ‘Token’ and ‘Sentence’ to be available in the document. The Tokeniser and Sentence Splitter can be any such PR from GATE, and need not be the Lingpipe variants. This PR adds a feature called ‘category’ on each token.

Below we list the runtime parameters for this PR.

19.15.4 LingPipe NER PR [#]

The LingPipe NER PR is used for named entity recognition. The PR recognizes entities such as Persons, Organizations and Locations in the text. This PR requires a model which it then uses to classify text as different entity types. An example model is provided under the ‘resources’ folder of this plugin. It must be provided at initialization time. Similar to other PRs, this PR expects users to provide name of the annotation set where the PR should output annotations.

19.15.5 LingPipe Language Identifier PR [#]

As the name suggests, this PR is useful for identifying the language of a document. This PR requires a model file which it then uses to identify the language of the document. An example model is provided under the ‘resources’ folder of this plugin. It must be provided at initialization time. Unlike other PRs which produce annotations, this PR adds a document feature. The name of the document feature can be specified as a runtime parameter. More information on how many languages are supported by the PR can be found on the following url: http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html

19.16 OpenNLP Plugin [#]

OpenNLP provides java-based tools for sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference. See the OpenNLP website for details: http://opennlp.sourceforge.net/. The tools use the Maxent machine learning package. See http://maxent.sourceforge.net/ for details.

In order to use these tools as GATE processing resources, load the ‘OpenNLP’ plugin via the Plugin Management Console. Alternatively, the OpenNLP system can be loaded from the GATE GUI by simply selecting the ‘Load OpenNLP System’ menu item under the ‘File’ menu. The OpenNLP PRs will be loaded, together with a pre-configured corpus pipeline application containing these PRs. This is similar to loading the ANNIE application with default values.

We have integrated the following five processing resources:

Note that in general, these PRs can be mixed with other PRs of similar types. For example, you could create a pipeline that uses the OpenNLP Tokenizer, and the ANNIE POS Tagger. You may occasionally have problems with some combinations. Notes on compatibility and PR prerequisites are given for each PR in Section 19.16.2.

Note also that some of the OpenNLP tools use quite large machine learning models, which the PRs need to load into memory. You may find that you have to give additional memory to GATE in order to use the OpenNLP PRs comfortably. See Section 2.7.5 for an example of how to do this.

Below, we describe the parameters common to all of the OpenNLP PRs. This is followed by a section which gives brief details of each PR. For more details on each, see the OpenNLP website, http://opennlp.sourceforge.net/.

19.16.1 Parameters common to all PRs [#]

Load-time Parameters [#]

All OpenNLP PRs have a ‘model’ parameter, which takes a URL. The URL should reference a valid Maxent model, or in the case of the Name Finder a directory containing a set of models for the different types of name sought. Default models can be found in the ‘models/english’ directory. In addition, the OpenNlpPOS POS Tagger PR has a ‘dictionary’ parameter, which also takes a URL, and a ‘dictionaryEncoding’ parameter giving the character encoding of the dictionary file. The default can be found in the ‘models/english’ directory.

For details of training new models (outside of the GATE framework), see Section 19.16.3

Run-time Parameters [#]

The OpenNLP PRs have runtime parameters to specify the annotation sets they should use for input and/or output. These are detailed below in the description of each PR, but all PRs will use the default unnamed annotation set unless told otherwise.

19.16.2 OpenNLP PRs [#]

OpenNlpTokenizer - Tokenizer PR [#]

This PR adds Token annotations to the annotation set specified by the annotationSetName parameter.

This PR does not require any other PR to be run beforehand. It creates annotations of type Token, with a feature and value ‘source=openNLP’ and a string feature that takes the underlying string as its value.

OpenNlpSentenceSplit - Sentence Splitter PR [#]

This PR adds Sentence annotations to the annotation set specified by the annotationSetName parameter.

This PR does not require any other PR to be run beforehand. It creates annotations of type Sentence, with a feature and value ‘source=openNLP’.

OpenNlpPOS - POS Tagger PR [#]

This PR adds a feature for Part Of Speech to Token annotations.

This PR requires Sentence and Token annotations to be present in the annotation set specified by its annotationSetName parameter before it will work. These Sentence and Token annotations do not have to be from another OpenNLP PR. They could, for example, be from the ANNIE PRs. This PR adds a ‘category’ feature to each Token, with the predicted Part Of Speech as value.

OpenNlpNameFinder - NER PR [#]

This PR finds standard Named Entities, adding them as Name annotations.

This PR requires Sentence and Token annotations to be present in the annotation set specified by its inputASName parameter before it will work. These Sentence and Token annotations do not have to be from another OpenNLP PR. They could, for example, be from the ANNIE PRs. You may find, however, that not all pairings of Tokenizer and Sentence Splitter will work successfully.

The Token annotations do not need to have a ‘category’ feature. In other words, you do not need to run a POS Tagger before using this PR.

This PR creates annotations of type ‘Name’ for each named entity in the annotation set specified by the outputASName parameter, with a feature and value ‘source=openNLP’. It also adds a feature of ‘type’ to each one, with values such as:

For full details of all types, see the OpenNLP website, http://opennlp.sourceforge.net/.

OpenNlpChunker - Chunker PR [#]

This PR finds noun, verb, and other chunks, adding their position as features Token annotations.

This PR requires Sentence and Token annotations to be present in the annotation set specified by the annotationSetName parameter before it will work. The Token annotations need to have a ‘category’ feature. In other words, you also need to run a POS Tagger before using this PR. The Sentence and Token annotations (and ‘category’ POS tag features) do not have to be from another OpenNLP PR. They could, for example, be from the ANNIE PRs.

This PR creates features of type ‘chunk’ for each token. The value of this feature define whether the token is at the beginning of a chunk, inside a chunk, or outside a chunk, using the standard BIO model. Example values and their interpretations are:

For full details of all chunk values, see the OpenNLP website, http://opennlp.sourceforge.net/.

19.16.3 Training new models [#]

Within the OpenNLP framework, new models can be trained for each of the tools. By default, the GATE PRs use the standard Maxent models for English which can be found in the plugin’s ‘models/english’ directory. The models are copies of those in the ”models” module in the OpenNLP CVS repository at http://opennlp.cvs.sourceforge.net. If you need to train a different model for an OpenNLP PR, you will have to do this outside of GATE, and then use the file URL of your new model as a value for the ‘model’ parameter of the PR. For details on how to train models, see the Maxent website http://maxent.sourceforge.net/.

19.17 Inter Annotator Agreement

The IAA plugin, “Inter_Annotator_Agreement”, computes interannotator agreement measures for various tasks. For named entity annotations, it computes the F-measures, namely Precision, Recall and F1, for two or more annotation sets. For text classification tasks, it computes Cohen’s kappa and some other IAA measures which are more suitable than the F-measures for the task. This plugin is fully documented in Section 10.5. Chapter 10 introduces various measures of interannotator agreement and describes a range of tools provided in GATE for calculating them.

19.18 Balanced Distance Metric Computation

The BDM (balanced distance metric) measures the closeness of two concepts in an ontology or taxonomy [Maynard 05Maynard et al. 06]. It is a real number between 0 and 1. The closer the two concepts are in an ontology, the greater their BDM score is. The plugin, “Ontology_BDM_Computation”, is described more fully in Section 10.6.

19.19 Schema Annotation Editor

The plugin ‘Schema_Annotation_Editor’ constrains the annotation editor to permitted types. See Section 3.4.6 for more information.

1GHC version 6.4.1 was used to build the supplied binaries for Windows, Linux and Mac

2See http://www.sighan.org/bakeoff2005/ for the Sighan-05 task

3see http://alias-i.com/lingpipe/