Chapter 19
More (CREOLE) Plugins [#]
For the previous reader was none other than myself. I had already read this book long ago.
The old sickness has me in its grip again: amnesia in litteris, the total loss of literary memory. I am overcome by a wave of resignation at the vanity of all striving for knowledge, all striving of any kind. Why read at all? Why read this book a second time, since I know that very soon not even a shadow of a recollection will remain of it? Why do anything at all, when all things fall apart? Why live, when one must die? And I clap the lovely book shut, stand up, and slink back, vanquished, demolished, to place it again among the mass of anonymous and forgotten volumes lined up on the shelf.
…
But perhaps - I think, to console myself - perhaps reading (like life) is not a matter of being shunted on to some track or abruptly off it. Maybe reading is an act by which consciousness is changed in such an imperceptible manner that the reader is not even aware of it. The reader suffering from amnesia in litteris is most definitely changed by his reading, but without noticing it, necause as he reads, those critical faculties of his brain that could tell him that change is occurring are changing as well. And for one who is himself a writer, the sickness may conceivably be a blessing, indeed a necessary precondition, since it protects him against that crippling awe which every great work of literature creates, and because it allows him to sustain a wholly uncomplicated relationship to plagiarism, without which nothing original can be created.
Three Stories and a Reflection, Patrick Suskind, 1995 (pp. 82, 86).
This chapter describes additional CREOLE resources which do not form part of ANNIE, and have not been covered in previous chapters.
19.1 Language Plugins [#]
There are plugins available for processing the following languages: French, German, Spanish, Italian, Chinese, Arabic, Romanian, Hindi and Cebuano. Some of the applications are quite basic and just contain some useful processing resources to get you started when developing a full application. Others (Cebuano and Hindi) are more like toy systems built as part of an exercise in language portability.
Note that if you wish to use individual language processing resources without loading the whole application, you will need to load the relevant plugin for that language in most cases. The plugins all follow the same kind of format. Load the plugin using the plugin manager in GATE Developer, and the relevant resources will be available in the Processing Resources set.
Some plugins just contain a list of resources which can be added ad hoc to other applications. For example, the Italian plugin simply contains a lexicon which can be used to replace the English lexicon in the default English POS tagger: this will provide a reasonable basic POS tagger for Italian.
In most cases you will also find a directory in the relevant plugin directory called data which contains some sample texts (in some cases, these are annotated with NEs).
19.1.1 French Plugin [#]
The French plugin contains two applications for NE recognition: one which includes the TreeTagger for POS tagging in French (french+tagger.gapp) , and one which does not (french.gapp). Simply load the application required from the plugins/Lang_French directory. You do not need to load the plugin itself from the GATE Developer plugins menu. Note that the TreeTagger must first be installed and set up correctly (see Section 17.3 for details). Check that the runtime parameters are set correctly for your TreeTagger in your application. The applications both contain resources for tokenisation, sentence splitting, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Note that they are not intended to produce high quality results, they are simply a starting point for a developer working on French. Some sample texts are contained in the plugins/Lang_French/data directory.
19.1.2 German Plugin [#]
The German plugin contains two applications for NE recognition: one which includes the TreeTagger for POS tagging in German (german+tagger.gapp) , and one which does not (german.gapp). Simply load the application required from the plugins/Lang_German/resources directory. You do not need to load the plugin itself from the GATE Developer plugins menu. Note that the TreeTagger must first be installed and set up correctly (see Section 17.3 for details). Check that the runtime parameters are set correctly for your TreeTagger in your application. The applications both contain resources for tokenisation, sentence splitting, gazetteer lookup, compound analysis, NE recognition (via JAPE grammars) and orthographic coreference. Some sample texts are contained in the plugins/Lang_German/data directory. We are grateful to Fabio Ciravegna and the Dot.KOM project for use of some of the components for the German plugin.
19.1.3 Romanian Plugin [#]
The Romanian plugin contains an application for Romanian NE recognition (romanian.gapp). Simply load the application from the plugins/Lang_Romanian/resources directory. You do not need to load the plugin itself from the GATE Developer plugins menu. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Some sample texts are contained in the plugins/romanian/corpus directory.
19.1.4 Arabic Plugin [#]
The Arabic plugin contains a simple application for Arabic NE recognition (arabic.gapp). Simply load the application from the plugins/Lang_Arabic/resources directory. You do not need to load the plugin itself from the GATE Developer plugins menu. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Note that there are two types of gazetteer used in this application: one which was derived automatically from training data (Arabic inferred gazetteer), and one which was created manually. Note that there are some other applications included which perform quite specific tasks (but can generally be ignored). For example, arabic-for-bbn.gapp and arabic-for-muse.gapp make use of a very specific set of training data and convert the result to a special format. There is also an application to collect new gazetteer lists from training data (arabic_lists_collector.gapp). For details of the gazetteer list collector please see Section 13.7.
19.1.5 Chinese Plugin [#]
The Chinese plugin contains a simple application for Chinese NE recognition (chinese.gapp). Simply load the application from the plugins/Lang_Chinese/resources directory. You do not need to load the plugin itself from the GATE Developer plugins menu. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. The application makes use of some gazetteer lists (and a grammar to process them) derived automatically from training data, as well as regular hand-crafted gazetteer lists. There are also applications (listscollector.gapp, adj_collector.gapp and nounperson_collector.gapp) to create such lists, and various other application to perform special tasks such as coreference evaluation (coreference_eval.gapp) and converting the output to a different format (ace-to-muse.gapp).
19.1.6 Hindi Plugin [#]
The Hindi plugin (‘Lang_Hindi’) contains a set of resources for basic Hindi NE recognition which mirror the ANNIE resources but are customised to the Hindi language. You need to have the ANNIE plugin loaded first in order to load any of these PRs. With the Hindi, you can create an application similar to ANNIE but replacing the ANNIE PRs with the default PRs from the plugin.
19.2 Flexible Exporter [#]
The Flexible Exporter enables the user to save a document (or corpus) in its original format with added annotations. The user can select the name of the annotation set from which these annotations are to be found, which annotations from this set are to be included, whether features are to be included, and various renaming options such as renaming the annotations and the file.
At load time, the following parameters can be set for the flexible exporter:
- includeFeatures - if set to true, features are included with the annotations exported; if false (the default status), they are not.
- useSuffixForDumpFiles - if set to true (the default status), the output files have the suffix defined in suffixForDumpFiles; if false, no suffix is defined, and the output file simply overwrites the existing file (but see the outputFileUrl runtime parameter for an alternative).
- suffixForDumpFiles - this defines the suffix if useSuffixForDumpFiles is set to true. By default the suffix is .gate.
The following runtime parameters can also be set (after the file has been selected for the application):
- annotationSetName - this enables the user to specify the name of the annotation set which contains the annotations to be exported. If no annotation set is defined, it will use the Default annotation set.
- annotationTypes - this contains a list of the annotations to be exported. By default it is set to Person, Location and Date.
- dumpTypes - this contains a list of names for the exported annotations. If the annotation name is to remain the same, this list should be identical to the list in annotationTypes. The list of annotation names must be in the same order as the corresponding annotation types in annotationTypes.
- outputDirectoryUrl - this enables the user to specify the export directory where the file is exported with its original name and an extension (provided as a parameter) appended at the end of filename. Note that you can also save a whole corpus in one go.
19.3 Annotation Set Transfer [#]
The Annotation Set Transfer allows copying or moving annotations to a new annotation set if they lie between the beginning and the end of an annotation of a particular type (the covering annotation). For example, this can be used when a user only wants to run a processing resource over a specific part of a document, such as the Body of an HTML document. The user specifies the name of the annotation set and the annotation which covers the part of the document they wish to transfer, and the name of the new annotation set. All the other annotations corresponding to the matched text will be transferred to the new annotation set. For example, we might wish to perform named entity recognition on the body of an HTML text, but not on the headers. After tokenising and performing gazetteer lookup on the whole text, we would use the Annotation Set Transfer to transfer those annotations (created by the tokeniser and gazetteer) into a new annotation set, and then run the remaining NE resources, such as the semantic tagger and coreference modules, on them.
The Annotation Set Transfer has no loadtime parameters. It has the following runtime parameters:
- inputASName - this defines the annotation set from which annotations will be transferred (copied or moved). If nothing is specified, the Default annotation set will be used.
- outputASName - this defines the annotation set to which the annotations will be transferred. This default value for this parameter is ‘Filtered’. If it is left blank the Default annotation set will be used.
- tagASName - this defines the annotation set which contains the annotation covering the relevant part of the document to be transferred. This default value for this parameter is ‘Original markups’. If it is left blank the Default annotation set will be used.
- textTagName - this defines the type of the annotation covering the annotations to be transferred. The default value for this parameter is ‘BODY’. If this is left blank, then all annotations from the inputASName annotation set will be transferred. If more than one covering annotation is found, the annotation covered by each of them will be transferrred. If no covering annotation is found, the processing depends on the copyAllUnlessFound parameter (see below).
- copyAnnotations - this specifies whether the annotations should be moved or copied. The default value false will move annotations, removing them from the inputASName annotation set. If set to true the annotations will be copied.
- transferAllUnlessFound - this specifies what should happen if no covering annotation is found. The default value is true. In this case, all annotations will be copied or moved (depending on the setting of parameter copyAnnotations) if no covering annotation is found. If set to false, no annotation will be copied or moved.
For example, suppose we wish to perform named entity recognition on only the text covered by the BODY annotation from the Original Markups annotation set in an HTML document. We have to run the gazetteer and tokeniser on the entire document, because since these resources do not depend on any other annotations, we cannot specify an input annotation set for them to use. We therefore transfer these annotations to a new annotation set (Filtered) and then perform the NE recognition over these annotations, by specifying this annotation set as the input annotation set for all the following resources. In this example, we would set the following parameters (assuming that the annotations from the tokenise and gazetteer are initially placed in the Default annotation set).
- inputASName: Default
- outputASName: Filtered
- tagASName: Original markups
- textTagName: BODY
- copyAnnotations: true or false (depending on whether we want to keep the Token and Lookup annotations in the Default annotation set)
- copyAllUnlessFound: true
19.4 Information Retrieval in GATE [#]
GATE comes with a full-featured Information Retrieval (IR) subsystem that allows queries to be performed against GATE corpora. This combination of IE and IR means that documents can be retrieved from the corpora not only based on their textual content but also according to their features or annotations. For example, a search over the Person annotations for ‘Bush’ will return documents with higher relevance, compared to a search in the content for the string ‘bush’. The current implementation is based on the most popular open source full-text search engine - Lucene (available at http://jakarta.apache.org/lucene/) but other implementations may be added in the future.
An Information Retrieval system is most often considered a system that accepts as input a set of documents (corpus) and a query (combination of search terms) and returns as input only those documents from the corpus which are considered as relevant according to the query. Usually, in addition to the documents, a proper relevance measure (score) is returned for each document. There exist many relevance metrics, but usually documents which are considered more relevant, according to the query, are scored higher.
Figure 19.1 shows the results from running a query against an indexed corpus in GATE.
|
Information Retrieval systems usually perform some preprocessing one the input corpus in order to create the document-term matrix for the corpus. A document-term matrix is usually presented as in Table 19.1, where doci is a document from the corpus, termj is a word that is considered as important and representative for the document and wi,j is the weight assigned to the term in the document. There are many ways to define the term weight functions, but most often it depends on the term frequency in the document and in the whole corpus (i.e. the local and the global frequency). Note that the machine learning plugin described in Chapter 15 can produce such document-term matrix (for detailed description of the matrix produced, see Section 15.2.4).
Note that not all of the words appearing in the document are considered terms. There are many words (called ‘stop-words’) which are ignored, since they are observed too often and are not representative enough. Such words are articles, conjunctions, etc. During the preprocessing phase which identifies such words, usually a form of stemming is performed in order to minimize the number of terms and to improve the retrieval recall. Various forms of the same word (e.g. ‘play’, ‘playing’ and ‘played’) are considered identical and multiple occurrences of the same term (probably ‘play’) will be observed.
It is recommended that the user reads the relevant Information Retrieval literature for a detailed explanation of stop words, stemming and term weighting.
IR systems, in a way similar to IE systems, are evaluated with the help of the precision and recall measures (see Section 10.1 for more details).
19.4.1 Using the IR Functionality in GATE
In order to run queries against a corpus, the latter should be ‘indexed’. The indexing process first processes the documents in order to identify the terms and their weights (stemming is performed too) and then creates the proper structures on the local filesystem. These file structures contain indexes that will be used by Lucene (the underlying IR engine) for the retrieval.
Once the corpus is indexed, queries may be run against it. Subsequently the index may be removed and then the structures on the local filesytem are removed too. Once the index is removed, queries cannot be run against the corpus.
Indexing the Corpus
In order to index a corpus, the latter should be stored in a serial datastore. In other words, the IR functionality is unavailable for corpora that are transient or stored in a RDBMS datastores (though support for the lattr may be added in the future).
To index the corpus, follow these steps:
- Select the corpus from the resource tree (top-left pane) and from the context menu (right button click) choose ‘Index Corpus’. A dialogue appears that allows you to specify the index properties.
- In the index properties dialogue, specify the underlying IR system to be used (only Lucene is supported at present), the directory that will contain the index structures, and the set of properties that will be indexed such as document features, content, etc (the same properties will be indexed for each document in the corpus).
- Once the corpus in indexed, you may start running queries against it. Note that the directory specified for the index data should exist and be empty. Otherwise an error will occur during the index creation.
Querying the Corpus
To query the corpus, follow these steps:
- Create a SearchPR processing resource. All the parameters of SearchPR are runtime so theyare set later.
- Create a pipeline application containing the SearchPR.
- Set the following SearchPR parameters:
- The corpus that will be queried.
- The query that will be executed.
- The maximum number of documents returned.
A query looks like the following:
{+/-}field1:term1 {+/-}field2:term2 ? {+/-}fieldN:termNwhere field is the name of a index field, such as the one specified at index creation (the document content field is body) and term is a term that should appear in the field.
For example the query:
+body:government +author:CNNwill inspect the document content for the term ‘government’ (together with variations such as ‘governments’ etc.) and the index field named ‘author’ for the term ‘CNN’. The ‘author’ field is specified at index creation time, and is either a document feature or another document property.
- After the SearchPR is initialized, running the application executes the specified query over the specified corpus.
- Finally, the results are displayed (see fig.1) after a double-click on the SearchPR processing resource.
Removing the Index
An index for a corpus may be removed at any time from the ‘Remove Index’ option of the context menu for the indexed corpus (right button click).
19.4.2 Using the IR API
The IR API within GATE Embedded makes it possible for corpora to be indexed, queried and results returned from any Java application, without using GATE Developer. The following sample indexes a corpus, runs a query against it and then removes the index.
1
2// open a serial data store
3SerialDataStore sds =
4Factory.openDataStore("gate.persist.SerialDataStore",
5"/tmp/datastore1");
6sds.open();
7
8//set an AUTHOR feature for the test document
9Document doc0 = Factory.newDocument(new URL("/tmp/documents/doc0.html"));
10doc0.getFeatures().put("author","John Smith");
11
12Corpus corp0 = Factory.newCorpus("TestCorpus");
13corp0.add(doc0);
14
15//store the corpus in the serial datastore
16Corpus serialCorpus = (Corpus) sds.adopt(corp0,null);
17sds.sync(serialCorpus);
18
19//index the corpus - the content and the AUTHOR feature
20
21IndexedCorpus indexedCorpus = (IndexedCorpus) serialCorpus;
22
23DefaultIndexDefinition did = new DefaultIndexDefinition();
24did.setIrEngineClassName(
25 gate.creole.ir.lucene.LuceneIREngine.class.getName());
26did.setIndexLocation("/tmp/index1");
27did.addIndexField(new IndexField("content",
28 new DocumentContentReader(), false));
29did.addIndexField(new IndexField("author", null, false));
30indexedCorpus.setIndexDefinition(did);
31
32indexedCorpus.getIndexManager().createIndex();
33//the corpus is now indexed
34
35//search the corpus
36Search search = new LuceneSearch();
37search.setCorpus(ic);
38
39QueryResultList res = search.search("+content:government +author:John");
40
41//get the results
42Iterator it = res.getQueryResults();
43while (it.hasNext()) {
44QueryResult qr = (QueryResult) it.next();
45System.out.println("DOCUMENT_ID=" + qr.getDocumentID()
46 + ", score=" + qr.getScore());
47}
19.5 Websphinx Web Crawler [#]
The plugin ‘Web_Crawler_Websphinx’ enables GATE to build a corpus from a web crawl. The crawler itself is Websphinx.This is a JAVA based multi-threaded web crawler that can be customized for any application.
N.B. If you are using this plugin via an IDE, you may need to make sure that the websphinx.jar file is on the IDE’s classpath, or add to the IDE’s lib directory.
The basic idea is to be able to specify a source URL and a depth to build the initial corpus upon which further processing could be done. The PR itself provides a number of helpful features to set various parameters of the crawl.
19.5.1 Using the Crawler PR
In order to use the processing resource you first need to load the plugin using the plugin manager. Then load the crawler from the list of processing resources. User needs to create a corpus in which he or she wants to store crawled documents. In order to use the crawler, create a simple pipeline (note: do not create a corpus pipeline) and add the crawl PR to the pipeline.
Once the crawl PR is created there will be a number of parameters that can be set based on the PR required (see also Figure 19.3).
- depth: the depth to which the crawl should proceed.
- dfs / bfs: dfs if true bfs if false
- Dfs : the crawler uses the depth first strategy for the crawl.
- Visits the nodes in dfs order until the specified depth limit is reached.
- Bfs: the crawler used the breadth first strategy for the crawl.
- Visits the nodes on bfs order until the specified depth limit is reached.
- Dfs : the crawler uses the depth first strategy for the crawl.
- domain
- SUBTREE: Crawler visits only the descendents of the page specified as the root for the crawl.
- WEB: Crawler visits all the pages on the web.
- SERVER: Crawler visits only the pages that are present on the server where the root page is located.
- max number of pages to be fetched
- outputCorpus an instance of Corpus to be used for storing crawled web pages
- root the starting URL to be used for the crawl to begin
- source is the corpus to be used that contains the documents from which the crawl must begin. Source is useful when the documents are fetched first from the Google PR and then need to be crawled to expand the web graph further. At any time either the source or the root needs to be set.
Once the parameters are set, the crawl can be run and the documents fetched are added to the specified corpus. Figure 19.4 shows the crawled pages added to the corpus.
N.B. Note that you must use a simple Pipeline, and not a Corpus Pipeline. If you wish to process the crawled documents, you must build a second Corpus Pipeline. Note that from GATE Version 5.1, you could combine the two pipelines as follows. Build a simple Pipeline containing your Web Crawler. Let’s call this Pipeline Ps. Let’s say you set the Corpus on your crawler to C. Now build your processing corpus pipeline, which we will call Pc. Put the original pipeline Ps as its first PR. Set its corpus to be the corpus as before, C.
19.6 Google Plugin [#]
The Google API is now integrated with GATE, and can be used as a PR-based plugin. This plugin (‘Web_Search_Google’) allows the user to query Google and build a document corpus that contains the search results returned by Google for the query. There is a limit of 1000 queries per day as set by Google. For more information about the Google API please refer to http://www.google.com/apis/. In order to use the Google PR, you need to register with Google to obtain a license key.
The Google PR can be used for a number of different application scenarios. For example, one use case is where a user wants to find the different named entities that can be associated with a particular individual. In this example, the user could build the collection of documents by querying Google with the individual’s name, and then running ANNIE over the collection. This would annotate the results and show the different Organization, Location and other entities associated with the query.
19.6.1 Using the GooglePR
In order to use the PR, you first need to load the plugin using the plugin manager. Once the PR is loaded, it can be initialized by creating an instance of a new PR. Here you need to specify the Google API License key. Please use the license key assigned to you by registering with Google.
Once the Google PR is initialized, it can be placed in a pipeline or a conditional pipeline application. This pipeline would contain the instance of the Google PR just initialized as above. There are a number of parameters to be set at runtime:
- corpus: The corpus used by the plugin to add or append documents from the Web.
- corpusAppendMode: If set to true, will append documents to the corpus. If set to false, will remove preexisting documents from the corpus, before adding the documents newly fetched by the PR
- limit: A limit on the results returned by the search. Default set to 10.
- pagesToExclude: This is an optional parameter. It is a list with URLs not to be included in the search.
- query: The query sent to Google. It is in the format accepted by Google.
Once the required parameters are set we can run the pipeline. This will then download all the URLs in the results and create a document for each. These documents would be added to the corpus as shown in Figure 19.5.
19.7 Yahoo Plugin [#]
The Yahoo API is now integrated with GATE, and can be used as a PR-based plugin. This plugin, ‘Web_Search_Yahoo’, allows the user to query Yahoo and build a document corpus that contains the search results returned by Yahoo for the query. For more information about the Yahoo API please refer to http://developer.yahoo.com/search/. In order to use the Yahoo PR, you need to obtain an application ID.
The Yahoo PR can be used for a number of different application scenarios. For example, one use case is where a user wants to find the different named entities that can be associated with a particular individual. In this example, the user could build a collection of documents by querying Yahoo with the individual’s name and then running ANNIE over the collection. This would annotate the results and show the different Organization, Location and other entities that are associated with the query.
19.7.1 Using the YahooPR
In order to use the PR, you first need to load the plugin using the GATE Developer plugin manager. Once the PR is loaded, it can be initialized by creating an instance of a new PR. Here you need to specify the Yahoo Application ID. Please use the license key assigned to you by registering with Yahoo.
Once the Yahoo PR is initialized, it can be placed in a pipeline or a conditional pipeline application. This pipeline would contain the instance of the Yahoo PR just initialized as above. There are a number of parameters to be set at runtime:
- corpus: The corpus used by the plugin to add or append documents from the Web.
- corpusAppendMode: If set to true, will append documents to the corpus. If set to false, will remove preexisting documents from the corpus, before adding the documents newly fetched by the PR
- limit: A limit on the results returned by the search. Default set to 10.
- pagesToExclude: This is an optional parameter. It is a list with URLs not to be included in the search.
- query: The query sent to Yahoo. It is in the format accepted by Yahoo.
Once the required parameters are set we can run the pipeline. This will then download all the URLs in the results and create a document for each. These documents would be added to the corpus.
19.8 WordNet in GATE [#]
At present GATE supports only WordNet 1.6, so in order to use WordNet in GATE, you must first install WordNet 1.6 on your computer. WordNet is available at http://wordnet.princeton.edu/. The next step is to configure GATE to work with your local WordNet installation. Since GATE relies on the Java WordNet Library (JWNL) for WordNet access, this step consists of providing one special xml file that is used internally by JWNL. This file describes the location of your local copy of the WordNet 1.6 index files. An example of this wn-config.xml file is shown below:
<?xml version="1.0" encoding="UTF-8"?> <jwnl_properties language="en"> <version publisher="Princeton" number="1.6" language="en"/> <dictionary class="net.didion.jwnl.dictionary.FileBackedDictionary"> <param name="morphological_processor" value="net.didion.jwnl.dictionary.DefaultMorphologicalProcessor"/> <param name="file_manager" value="net.didion.jwnl.dictionary.file_manager.FileManagerImpl"> <param name="file_type" value="net.didion.jwnl.princeton.file.PrincetonRandomAccessDictionaryFile"/> <param name="dictionary_path" value="e:\wn16\dict"/> </param> </dictionary> <dictionary_element_factory class="net.didion.jwnl.princeton.data.PrincetonWN16DictionaryElementFactory"/> <resource class="PrincetonResource"/> </jwnl_properties> |
All you have to do is to replace the value of the dictionary_path parameter to point to your local installation of WordNet 1.6.
After configuring GATE to use WordNet, you can start using the built-in WordNet browser or API. In GATE Developer, load the WordNet plugin via the plugins menu. Then load WordNet by selecting it from the set of available language resources. Set the value of the parameter to the path of the xml properties file which describes the WordNet location (wn-config).
Once Word Net is loaded in GATE Developer, the well-known interface of WordNet will appear. You can search Word Net by typing a word in the box next to to the label ‘SearchWord” and then pressing ‘Search’. All the senses of the word will be displayed in the window below. Buttons for the possible parts of speech for this word will also be activated at this point. For instance, for the word ‘play’, the buttons ‘Noun’, ‘Verb’ and ‘Adjective’ are activated. Pressing one of these buttons will activate a menu with hyponyms, hypernyms, meronyms for nouns or verb groups, and cause for verbs, etc. Selecting an item from the menu will display the results in the window below.
More information about WordNet can be found at http://wordnet.princeton.edu/
More information about the JWNL library can be found at http://sourceforge.net/projects/jwordnet
An example of using the WordNet API in GATE is available on the GATE examples page at http://gate.ac.uk/GateExamples/doc/index.html.
19.8.1 The WordNet API
GATE Embedded offers a set of classes that can be used to access the WordNet 1.6 Lexical Base. The implementation of the GATE API for WordNet is based on Java WordNet Library (JWNL). There are just a few basic classes, as shown in Figure 19.8. Details about the properties and methods of the interfaces/classes comprising the API can be obtained from the JavaDoc. Below is a brief overview of the interfaces:
- WordNet: the main WordNet class. Provides methods for getting the synsets of a lemma, for accessing the unique beginners, etc.
- Word: offers access to the word’s lemma and senses
- WordSense: gives access to the synset, the word, POS and lexical relations.
- Synset: gives acess to the word senses (synonyms) in the synset, the semantic relations, POS etc.
- Verb: gives access to the verb frames (not working properly at present)
- Adjective: gives access to the adj. position (attributive, predicative, etc.).
- Relation: abstract relation such as type, symbol, inverse relation, set of POS tags, etc. to which it is applicable.
- LexicalRelation
- SemanticRelation
- VerbFrame
19.9 Kea - Automatic Keyphrase Detection [#]
Kea is a tool for automatic detection of key phrases developed at the University of Waikato in New Zealand. The home page of the project can be found at http://www.nzdl.org/Kea/.
This user guide section only deals with the aspects relating to the integration of Kea in GATE. For the inner workings of Kea, please visit the Kea web site and/or contact its authors.
In order to use Kea in GATE Developer, the ‘Keyphrase_Extraction_Algorithm’ plugin needs to be loaded using the plugins management console. After doing that, two new resource types are available for creation: the ‘KEA Keyphrase Extractor’ (a processing resource) and the ‘KEA Corpus Importer’ (a visual resource associated with the PR).
19.9.1 Using the ‘KEA Keyphrase Extractor’ PR
Kea is based on machine learning and it needs to be trained before it can be used to extract keyphrases. In order to do this, a corpus is required where the documents are annotated with keyphrases. Corpora in the Kea format (where the text and keyphrases are in separate files with the same name but different extensions) can be imported into GATE using the ‘KEA Corpus Importer’ tool. The usage of this tool is presented in a subsection below.
Once an annotated corpus is obtained, the ‘KEA Keyphrase Extractor’ PR can be used to build a model:
- load a ‘KEA Keyphrase Extractor’
- create a new ‘Corpus Pipeline’ controller.
- set the corpus for the controller
- set the ‘trainingMode’ parameter for the PR to ‘true’
- run the application.
After these steps, the Kea PR contains a trained model. This can be used immediately by switching the ‘trainingMode’ parameter to ‘false’ and running the PR over the documents that need to be annotated with keyphrases. Another possiblity is to save the model for later use, by right-clicking on the PR name in the right hand side tree and choosing the ‘Save model’ option.
When a previously built model is availalbe, the training procedure does not need to be repeated, the exisiting model can be loaded in memory by selecting the ‘Load model’ option in the PR’s pop-up menu.
The Kea PR uses several parameters as seen in Figure 19.9:
- document
- The document to be processed.
- inputAS
- The input annotation set. This parameter is only relevant when the PR is running in training mode and it specifies the annotation set containing the keyphrase annotations.
- outputAS
- The output annotation set. This parameter is only relevant when the PR is running in application mode (i.e. when the ‘trainingMode’ parameter is set to false) and it specifies the annotation set where the generated keyphrase annotations will be saved.
- minPhraseLength
- the minimum length (in number of words) for a keyphrase.
- minNumOccur
- the minimum number of occurences of a phrase for it to be a keyphrase.
- maxPhraseLength
- the maximum length of a keyphrase.
- phrasesToExtract
- how many different keyphrases should be generated.
- keyphraseAnnotationType
- the type of annotations used for keyphrases.
- dissallowInternalPeriods
- should internal periods be dissallowed.
- trainingMode
- if ‘true’ the PR is running in training mode; otherwise it is running in application mode.
- useKFrequency
- should the K-frequency be used.
19.9.2 Using Kea Corpora
The authors of Kea provide on the project web page a few manually annotated corpora that can be used for training Kea. In order to do this from within GATE, these corpora need to be converted to the format used in GATE (i.e. GATE documents with annotations). This is possible using the ‘KEA Corpus Importer’ tool which is available as a visual resource associated with the Kea PR. The importer tool can be made visible by double-clicking on the Kea PR’s name in the resources tree and then selecting the ‘KEA Corpus Importer’ tab, see Figure 19.10.
The tool will read files from a given directory, converting the text ones into GATE documents and the ones containing keyphrases into annotations over the documents.
The user needs to specify a few values:
- Source Directory
- the directory containing the text and key files. This can be typed in or selected by pressing the folder button next to the text field.
- Extension for text files
- the extension used for text fiels (by default .txt).
- Extension for keyphrase files
- the extension for the files listing keyphrases.
- Encoding for input files
- the encoding to be used when reading the files.
- Corpus name
- the name for the GATE corpus that will be created.
- Output annotaion set
- the name for the anntoation set that will contain the keyphrases read from the input files.
- Keyphrase annotation type
- the type for the generated annotations.
19.10 Ontotext JapeC Compiler [#]
Note: the JapeC compiler does not currently support the new JAPE language features introduced in July–September 2008. If you need to use negation, the @length and @string accessors, the contextual operators within and contains, or any comparison operators other than ==, then you will need to use the standard JAPE transducer instead of JapeC.
JapeC is an alternative implementation of the JAPE language which works by compiling JAPE grammars into Java code. Compared to the standard implementation, these compiled grammars can be several times faster to run. At Ontotext, a modified version of the ANNIE sentence splitter using compiled grammars has been found to run up to five times as fast as the standard version. The compiler can be invoked manually from the command line, or used through the ‘Ontotext Japec Compiler’ PR in the Jape_Compiler plugin.
The ‘Ontotext Japec Transducer’ (com.ontotext.gate.japec.JapecTransducer) is a processing resource that is designed to be an alternative to the original Jape Transducer. You can simply replace gate.creole.Transducer with com.ontotext.gate.japec.JapecTransducer in your gate application and it should work as expected.
The Japec transducer takes the same parameters as the standard JAPE transducer:
- grammarURL
- the URL from which the grammar is to be loaded. Note that the Japec Transducer will only work on file: URLs. Also, the alternative binaryGrammarURL parameter of the standard transducer is not supported.
- encoding
- the character encoding used to load the grammars.
- ontology
- the ontology used for ontolog-aware transduction.
Its runtime parameters are likewise the same as those of the standard transducer:
- document
- the document to process.
- inputASName
- name of the AnnotationSet from which input annotations to the transducer are read.
- outputASName
- name of the AnnotationSet to which output annotations from the transducer are written.
The Japec compiler itself is written in Haskell. Compiled binaries are provided for Windows, Linux (x86) and Mac OS X (PowerPC), so no Haskell interpreter is required to run Japec on these platforms. For other platforms, or if you make changes to the compiler source code, you can build the compiler yourself using the Ant build file in the Jape_Compiler plugin directory. You will need to install the latest version of the Glasgow Haskell Compiler1 and associated libraries. The japec compiler can then be built by running:
../../bin/ant japec.clean japec
|
from the Jape_Compiler plugin directory.
19.11 Annotation Merging Plugin [#]
If we have annotations about the same subject on the same document from different annotators, we may need to merge the annotations.
This plugin implements two approaches for annotation merging.
MajorityVoting takes a parameter numMinK and selects the annotation on which at least numMinK annotators agree. If two or more merged annotations have the same span, then the annotation with the most supporters is kept and other annotations with the same span are discarded.
MergingByAnnotatorNum selects one annotation from those annotations with the same span, which the majority of the annotators support. Note that if one annotator did not create the annotation with the particular span, we count it as one non-support of the annotation with the span. If it turns out that the majority of the annotators did not support the annotation with that span, then no annotation with the span would be put into the merged annotations.
The annotation merging methods are available via the Annotation Merging plugin. The plugin can be used as a PR in a pipeline or corpus pipeline. To use the PR, each document in the pipeline or the corpus pipeline should have the annotation sets for merging. The annotation merging PR has no loading parameters but has several run-time parameters, explained further below..
The annotation merging methods are implemented in the GATE API, and are available in GATE Embedded as described in Section 7.17.
- annSetOutput: the annotation set in the current document for storing the merged annotations. You should not use an existing annotation set, as the contents may be deleted or overwritten.
- annSetsForMerging: the annotation sets in the document for merging. It is an optional parameter. If it is not assigned with any value, the annotation sets for merging would be all the annotation sets in the document except the default annotation set. If specified, it is a sequence of the names of the annotation sets for merging, separated by ‘;’. For example, the value ‘a-1;a-2;a-3’ represents three annotation set, ‘a-1’, ‘a-2’ and ‘a-3’.
- annTypeAndFeats: the annotation types in the annotation set for merging. It is an optional parameter. It specifies the annotation types in the annotation sets for merging. For each type specified, it may also specify an annotation feature of the type and the values of the feature define the labels of the annotation type. If the parameter is not set a value, the annotation types for merging are all the types in the annotation sets for merging, and no annotation feature for each type is specified. If the parameter is specified, it is a sequence of names of annotation types, separated by ‘;’. If one annotation type has one particular annotation feature to indicate the label of the annotation, the annotation feature will immediately follow the annotation type’s name and is separated by ‘->’ in the sequence. For example, the value ‘SENT->senRel;OPINION_OPR;OPINION_SRC->type’ specifies three annotation types, ‘SENT’, ‘OPINION_OPR’ and ‘OPINION_SRC’ and specifies the annotation feature ‘senRel’ and ‘type’ for the two types SENT and OPINION_SRC, respectively but does not specify any feature for the type OPINION_OPR.
- keepSourceForMergedAnnotations: should source annotations be kept in the annSetsForMerging annotation sets when merged? True by default.
- mergingMethod: specifies the method used for merging. Currently it has two values MajorityVoting and MergingByAnnotatorNum, referring to the two merging methods described above, respectively.
- minimalAnnNum: specifies the minimal number of annotators who agree on one annotation in order to put the annotation into merged set, which is needed by the merging method MergingByAnnotatorNum. If the value of the parameter is smaller than 1, set the parameter as 1. If the value is bigger than total number of annotation sets for merging, set the parameter as the total number of annotation sets. If not assigning anything to the parameter in the GUI, it use the default value 1. Note that the parameter does not have any effect on another merging method MajorityVoting.
19.12 Chinese Word Segmentation [#]
Unlike English, Chinese text does not have a symbol (or delimiter) such as blank space to explicitly separate a word from the surrounding words. Therefore, for automatic Chinese text processing, we may need a system to recognise the words in Chinese text, a problem known as Chinese word segmentation. The plugin described in this section performs the task of Chinese word segmentation. It is based on our work using the Perceptron learning algorithm for the Chinese word segmentation task of the Sighan 20052. [Li et al. 05c]. Our Perceptron based system has achieved very good performance in the Sighan-05 task.
The plugin is called Segmenter_Chinese and is available in the GATE distribution. The corresponding processing resource’s name is Chinese Segmenter PR. Once you load the PR into GATE, you may put it into a Pipeline application. Note that it does not process a corpus of documents, but a directory of documents provided as a parameter (see description of parameters below). The plugin can be used to learn a model from segmented Chinese text as training data. It can also use the learned model to segment Chinese text. The plugin can use different learning algorithms to learn different models. It can deal with different character encocodings for Chinese text, such as UTF-8, GB2312 or BIG5. These options can be selected by setting the run-time parameters of the plugin.
The plugin has five run-time parameters, which are described in the following.
- learningAlg is a String variable, which specifies the learning algorithm used for
producing the model. Currently it has two values, PAUM and SVM, representing the
two popular learning algorithms Perceptron and SVM, respectively. The default value
is PAUM.
Generally speaking, SVM may perform better than Perceptron, in particular for small training sets. On the other hand, Perceptron’s learning is much faster than SVM’s. Hence, if you have a small training set, you may want to use SVM to obtain a better model. However, if you have a big training set which is typical for the Chinese word segmentation task, you may want to use Perceptron for learning, because the SVM’s learning may take too long time. In addition, using a big training set, the performance of the Perceptron model is quite similar to that of the SVM model. See [Li et al. 05c] for the experimental comparison of SVM and Perceptron on Chinese word segmentation. - learningMode determines the two modes of using the plugin, either learning a model
from training data or applying a learned model to segment Chinese text. Accordingly it
has two values, SEGMENTING and LEARNING. The default value is SEGMENTING,
meaning segmenting the Chinese text.
Note that you first need to learn a model and then you can use the learned model to segment the text. Several models using the training data used in the Sighan-05 Bakeoff are available for this plugin, which you can use to segment your Chinese text. More descriptions about the provided models will be given below. - modelURL specifies an URL referring to a directory containing the model. If the plugin is in the LEARNING runmode, the model learned will be put into the directory. If it is in the SEGMENTING runmode, the plugin will use the model stored in the directory to segment the text. The models learned from the Sighan-05 bakeoff training data will be discussed below.
- textCode specifies the encoding of the text used. For example it can be UTF-8, BIG5, GB2312 or any other encoding for Chinese text. Note that, when you segment some Chinese text using a learned model, the Chinese text should use the same encoding as the one used by the training text for obtaining the model.
- textFilesURL specifies an URL referring to a directory containing the Chinese documents. All the documents contained in this directory (but not those documents contained in its sub-directory if there is any) will be used as input data. In the LEARNING runmode, those documents contain the segmented Chinese text as training data. In the SEGMENTING runmode, the text in those documents will be segmented. The segmented text will be stored in the corresponding documents in the sub-directory called segmented.
The following PAUM models are available for the plugin and can be downloaded as below. In detail, those models were learned using the PAUM learning algorithm from the corpora provided by Sighan-05 bakeoff task.
- the PAUM model learned from PKU training data, using the PAUM learning algorithm and the UTF-8 encoding, can be downloaded from http://www.gate.ac.uk/resources/chineseSegmentation/model-paum-pku-utf8.zip.
- the PAUM model learned from PKU training data, using the PAUM learning algorithm and the GB2312 encoding, can be downloaded from http://www.gate.ac.uk/resources/chineseSegmentation/model-paum-pku-gb.zip.
- the PAUM model learned from AS training data, using the PAUM learning algorithm and the UTF-8 encoding, can be downloaded from http://www.gate.ac.uk/resources/chineseSegmentation/model-as-utf8.zip.
- the PAUM model learned from AS training data, using the PAUM learning algorithm and the BIG5 encoding, can be downloaded from http://www.gate.ac.uk/resources/chineseSegmentation/model-as-big5.zip.
As you can see, those models were learned using different training data and different Chinese text encodings of the same training data. The PKU training data are news articles published in mainland China and use simplified Chinese, while the AS training data are news articles published in Taiwan and use traditional Chinese. If your text are in simplified Chinese, you can use the models trained by the PKU data. If your text are in traditional Chinese, you need to use the models trained by the AS data. If your data are in GB2312 encoding or any compatible encoding, you need use the model trained by the corpus in GB2312 encoding.
Note that the segmented Chinese text (either used as training data or produced by this plugin) use the blank space to separate a word from its surrounding words. Hence, if your data are in Unicode such as UTF-8, you can use the GATE Unicode Tokeniser to process the segmented text to add the Token annotations into your text to represent the Chinese words. Once you get the annotations for all the Chinese words, you can perform further processing such as POS tagging and named entity recogntion.
19.13 Copying Annotations between Documents [#]
Sometimes a document has two copies, each of which was annotated by different annotators for the same task. We may want to copy the annotations in one copy to the other copy of the document. This could be in order to use less resources, or so that we can process them with some other plugin, such as annotation merging or IAA. The Copy_Annots_Between_Docs plugin does exactly this.
The plugin is available with the GATE distribution. When loading the plugin into GATE, it is represented as a processing resource, Copy Anns to Another Doc PR. You need to put the PR into a Corpus Pipeline to use it. The plugin does not have any initialisation parameters. It has several run-time parameters, which specify the annotations to be copied, the source documents and target documents. In detail, the run-time parameters are:
- sourceFilesURL specifies a directory in which the source documents are in. The source documents must be GATE xml documents. The plugin copies the annotations from these source documents to target documents.
- inputASName specifies the name of the annotation set in the source documents. Whole annotations or parts of annotations in the annotation set will be copied.
- annotationTypes specifies one or more annotation types in the annotation set inputASName which will be copied into target documents. If no value is given, the plugin will copy all annotations in the annotation set.
- outputASName specifies the name of the annotation set in the target documents, into which the annotations will be copied. If there is no such annotation set in the target documents, the annotation set will be created automatically.
The Corpus parameter of the Corpus Pipeline application containing the plugin specifies a corpus which contains the target documents. Given one (target) document in the corpus, the plugin tries to find a source document in the source directory specified by the parameter sourceFilesURL, according to the similarity of the names of the source and target documents. The similarity of two file names is calculated by comparing the two strings of names from the start to the end of the strings. Two names have greater similarity if they share more characters from the beginning of the strings. For example, suppose two target documents have the names aabcc.xml and abcab.xml and three source files have names abacc.xml, abcbb.xml and aacc.xml, respectively. Then the target document aabcc.xml has the corresponding source document aacc.xml, and abcab.xml has the corresponding source document abcbb.xml.
19.14 OpenCalais Plugin [#]
OpenCalais provides a web service for semantic annotation of text. The user submits a document to the web service, which returns entity and relations annotations in RDF, JSON or some other format. Typically, users integrate OpenCalais annotation of their web pages to provide additional links and ‘semantic functionality’. OpenCalais can be found at http://www.opencalais.com
The GATE OpenCalais PR submits a GATE document to the OpenCalais web service, and adds the annotations from the OpenCalais response as GATE annotations in the GATE document. It therefore provides OpenCalais semantic annotation functionality within GATE, for use by other PRs.
The PR only supports OpenCalais entities, not relations - although this should be straightforward for a competent Java programmer to add. Each OpenCalais entity is represented in GATE as an OpenCalais annotation, with features as given in the OpenCalais documentation.
The PR can be loaded with the CREOLE plugin manager dialog, from the creole directory in the gate distribution, gate/plugins/Tagger_OpenCalais. In order to use the PR, you will need to have an OpenCalais account, and request an OpenCalais service key. You can do this from the OpenCalais web site at http://www.opencalais.com. Provide your service key as an initialisation parameter when you create a new OpenCalais PR in GATE. OpenCalais make restrictions on the the number of requests you can make to their web service. See the OpenCalais web page for details.
Initialisation parameters are:
- openCalaisURL This is the URL of the OpenCalais REST service, and should not need to be changed - unless OpenCalais moves it!
- licenseID Your OpenCalais service key. This has to be requested from OpenCalais and is specific to you.
Various runtime parameters are available from the OpenCalais API, and are named the same as in that API. See the OpenCalais documentation for further details.
19.15 LingPipe Plugin [#]
LingPipe is a suite of Java libraries for the linguistic analysis of human language3. We have provided a plugin called ‘LingPipe’ with wrappers for some of the resources available in the LingPipe library. In order to use these resources, please load the ‘LingPipe’ plugin. Currently, we have integrated the following five processing resources.
- LingPipe Tokenizer PR
- LingPipe Sentence Splitter PR
- LingPipe POS Tagger PR
- LingPipe NER PR
- LingPipe Language Identifier PR
Please note that most of the resources in the LingPipe library allow learning of new models. However, in this version of the GATE plugin for LingPipe, we have only integrated the application functionality. You will need to learn new models with Lingpipe outside of GATE. We have provided some example models under the ‘resources’ folder which were downloaded from LingPipe’s website. For more information on licensing issues related to the use of these models, please refer to the licensing terms under the LingPipe plugin directory.
The LingPipe system can be loaded from the GATE GUI by simply selecting the ‘Load LingPipe system’ menu item under the ‘File’ menu. This is similar to loading the ANNIE application with default values.
19.15.1 LingPipe Tokenizer PR [#]
As the name suggests this PR tokenizes document text and identifies the boundaries of tokens. Each token is annotated with an annotation of type ‘Token’. Every annotation has a feature called ‘length’ that gives a length of the word in number of characters. There are no initialization parameters for this PR. The user needs to provide the name of the annotation set where the PR should output Token annotations.
19.15.2 LingPipe Sentence Splitter PR [#]
As the name suggests, this PR splits document text in sentences. It identifies sentence boundaries and annotates each sentence with an annotation of type ‘Sentence’. There are no initialization parameters for this PR. The user needs to provide name of the annotation set where the PR should output Sentence annotations.
19.15.3 LingPipe POS Tagger PR [#]
The LingPipe POS Tagger PR is useful for tagging individual tokens with their respective part of speech tags. This PR requires a model which it then uses to tag the tokens. An example model is provided under the ‘resources’ folder of this plugin. It must be provided at initialization time. It is a prerequisites of this PR that the document is processed with a Tokenizer and a Sentence Splitter. In other words, it expects annotations of type ‘Token’ and ‘Sentence’ to be available in the document. The Tokeniser and Sentence Splitter can be any such PR from GATE, and need not be the Lingpipe variants. This PR adds a feature called ‘category’ on each token.
Below we list the runtime parameters for this PR.
- inputASName: This is the name of the annotation set with ‘Token’ and ‘Sentence’ annotations in it.
- applicationMode: The POS tagger can be applied on the text in three diffrent
modes.
- FIRSTBEST: In this case, the POS tagger suggests one tag for each token that is best according to its calculations.
- CONFIDENCE: This is same as the FIRSTBEST except that in this case it also adds a feature ‘score’ with the actual calculated score for the tag that is assigned to the token.
- NBEST: In this case, the POS tagger suggests n best tags for each token. The default value is set to 5. In other words, it suggests five best tags for each token.
19.15.4 LingPipe NER PR [#]
The LingPipe NER PR is used for named entity recognition. The PR recognizes entities such as Persons, Organizations and Locations in the text. This PR requires a model which it then uses to classify text as different entity types. An example model is provided under the ‘resources’ folder of this plugin. It must be provided at initialization time. Similar to other PRs, this PR expects users to provide name of the annotation set where the PR should output annotations.
19.15.5 LingPipe Language Identifier PR [#]
As the name suggests, this PR is useful for identifying the language of a document. This PR requires a model file which it then uses to identify the language of the document. An example model is provided under the ‘resources’ folder of this plugin. It must be provided at initialization time. Unlike other PRs which produce annotations, this PR adds a document feature. The name of the document feature can be specified as a runtime parameter. More information on how many languages are supported by the PR can be found on the following url: http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html
19.16 OpenNLP Plugin [#]
The OpenNLP system can be loaded from the GATE GUI by simply selecting the ‘Load OpenNLP system’ menu item under the ‘File’ menu. This is similar to loading the ANNIE application with default values.
19.17 Inter Annotator Agreement
The IAA plugin, “Inter_Annotator_Agreement”, computes interannotator agreement measures for various tasks. For named entity annotations, it computes the F-measures, namely Precision, Recall and F1, for two or more annotation sets. For text classification tasks, it computes Cohen’s kappa and some other IAA measures which are more suitable than the F-measures for the task. This plugin is fully documented in Section 10.5. Chapter 10 introduces various measures of interannotator agreement and describes a range of tools provided in GATE for calculating them.
19.18 Balanced Distance Metric Computation
The BDM (balanced distance metric) measures the closeness of two concepts in an ontology or taxonomy [Maynard 05, Maynard et al. 06]. It is a real number between 0 and 1. The closer the two concepts are in an ontology, the greater their BDM score is. The plugin, “Ontology_BDM_Computation”, is described more fully in Section 10.6.
19.19 Schema Annotation Editor
The plugin ‘Schema_Annotation_Editor’ constrains the annotation editor to permitted types. See Section 3.4.6 for more information.
1GHC version 6.4.1 was used to build the supplied binaries for Windows, Linux and Mac
2See http://www.sighan.org/bakeoff2005/ for the Sighan-05 task