Log in Help
Print
Homereleasesgate-5.0-build3244-ALLdoctao 〉 splitch9.html
 

Chapter 9
(More CREOLE) Plugins [#]

For the previous reader was none other than myself. I had already read this book long ago.

The old sickness has me in its grip again: amnesia in litteris, the total loss of literary memory. I am overcome by a wave of resignation at the vanity of all striving for knowledge, all striving of any kind. Why read at all? Why read this book a second time, since I know that very soon not even a shadow of a recollection will remain of it? Why do anything at all, when all things fall apart? Why live, when one must die? And I clap the lovely book shut, stand up, and slink back, vanquished, demolished, to place it again among the mass of anonymous and forgotten volumes lined up on the shelf.

But perhaps - I think, to console myself - perhaps reading (like life) is not a matter of being shunted on to some track or abruptly off it. Maybe reading is an act by which consciousness is changed in such an imperceptible manner that the reader is not even aware of it. The reader suffering from amnesia in litteris is most definitely changed by his reading, but without noticing it, necause as he reads, those critical faculties of his brain that could tell him that change is occurring are changing as well. And for one who is himself a writer, the sickness may conceivably be a blessing, indeed a necessary precondition, since it protects him against that crippling awe which every great work of literature creates, and because it allows him to sustain a wholly uncomplicated relationship to plagiarism, without which nothing original can be created.

Three Stories and a Reflection, Patrick Suskind, 1995 (pp. 82, 86).

This chapter describes additional CREOLE resources which do not form part of ANNIE.

9.1 Document Reset [#]

The document reset resource enables the document to be reset to its original state, by removing all the annotation sets and their contents, apart from the one containing the document format analysis (Original Markups). An optional parameter, keepOriginalMarkupsAS, allows users to decide whether to keep the Original Markups AS or not while reseting the document. This resource is normally added to the beginning of an application, so that a document is reset before an application is rerun on that document.

9.2 Verb Group Chunker [#]

The rule-based verb chunker is based on a number of grammars of English [Cobuild 99Azar 89]. We have developed 68 rules for the identification of non recursive verb groups. The rules cover finite (’is investigating’), non-finite (’to investigate’), participles (’investigated’), and special verb constructs (’is going to investigate’). All the forms may include adverbials and negatives. The rules have been implemented in JAPE. The finite state analyser produces an annotation of type ’VG’ with features and values that encode syntactic information (’type’, ’tense’, ’voice’, ’neg’, etc.). The rules use the output of the POS tagger as well as information about the identity of the tokens (e.g. the token ’might’ is used to identify modals).

The grammar for verb group identification can be loaded as a Jape grammar into the GATE architecture and can be used in any application: the module is domain independent.

9.3 Noun Phrase Chunker [#]

The NP Chunker application is a Java implementation of the Ramshaw and Marcus BaseNP chunker (in fact the files in the resources directory are taken straight from their original distribution) which attempts to insert brackets marking noun phrases in text which have been marked with POS tags in the same format as the output of Eric Brill’s transformational tagger. The output from this version should be identical to the output of the oringinal C++/Perl version released by Ramshaw and Marcus.

For more information about baseNP structures and the use of tranformation-based learning to derive them, see [Ramshaw & Marcus 95].

9.3.1 Differences from the Original

The major difference is the assumption is made that if a POS tag is not in the mapping file then it is tagged as ’I’. The original version simply failed if an unknown POS tag was encountered. When using the GATE wrapper the chunk tag can be changed from ’I’ to any other legal tag (B or O) by setting the unknownTag parameter.

9.3.2 Using the Chunker

The Chunker requires the Creole plugin ”NP_Chunking” to be loaded. The two loadtime parameters are simply urls pointing at the POS tag dictionary and the rules file, which should be set automatically. There are five runtime parameters which should be set prior to executing the chunker.

The chunker requires the following PRs to have been run first: tokeniser, sentence splitter, POS tagger.

9.4 OntoText Gazetteer [#]

The OntoText Gazetteer is a Natural Gazetteer, implemented from the OntoText Lab (http://www.ontotext.com/). Its implementaion is based on simple lookup in several java.util.HashMap, and is inspired by the strange idea of Atanas Kiryakov, that searching in HashMaps will be faster than a search in a Finite State Machine (FSM).

Here follows a description of the algorithm that lies behind this implementation:

Every phrase i.e. every list entry is separated into several parts. The parts are determined by the whitespaces lying among them. e.g. the phrase : ”form is emptiness” has three parts : form, is & emptiness. There is also a list of HashMaps: mapsList which has as many elements as the longest (in terms of ”count of parts”) phrase in the lists. So the first part of a phrase is placed in the first map. The first part + space + second part is placed in the second map, etc. The full phrase is placed in the appropriate map, and a reference to a Lookup object is attached to it.

On first sight it seems that this algorithm is certainly much more memory-consuming than a finite state machine (FSM) with the parts of the phrases as transitions, but this is actually not so important since the average length of the phrases (in parts) in the lists is 1.1. On the other hand, one advantage of the algorithm is that, although unconventional, on average it takes four times less memory and works three times faster than an optimized FSM implementation.

The lookup part is implemented in execute() so a lot of tokenization takes place there. After defining the candidates for phrase-parts, we build a candidate phrase and try to look it up in the maps (in which map again depends on the count of parts in the current candidate phrase).

9.4.1 Prerequisites

The phrases to be recognised should be listed in a set of files, one for each type of occurrence (as for the standard gazetteer).

The gazetteer is built with the information from a file that contains the set of lists (which are files as well) and the associated type for each list. The file defining the set of lists should have the following syntax: each list definition should be written on its own line and should contain:

The elements of each definition are separated by ”:”. The following is an example of a valid definition:

personmale.lst:person:male:english

Each file named in the lists definition file is just a list containing one entry per line.

When this gazetter is run over some input text (a GATE document) it will generate annotations of type Lookup having the attributes specified in the definition file.

9.4.2 Setup

In order to use this gazetteer from within GATE the following should reside in the creole setup file (creole.xml):

    <RESOURCE>  
      <NAME>OntoText Gazetteer</NAME>  
      <CLASS>com.ontotext.gate.gazetteer.NaturalGazetteer</CLASS>  
      <COMMENT>A list lookup component. for documentation please refer to  
(www.ontotext.com/gate/gazetteer/documentation/index.html). For licence  
information please refer to  
(www.ontotext.com/gate/gazetteer/documentation/licence.ontotext.html) or to  
licence.ontotext.html in the lib folder of  
GATE</COMMENT>  
      <PARAMETER NAME="document" RUNTIME="true" COMMENT="The document to be  
processed">gate.Document</PARAMETER>  
      <PARAMETER NAME="annotationSetName" RUNTIME="true" COMMENT="The  
annotation set to be used for the generated  
annotations" OPTIONAL="true">java.lang.String</PARAMETER>  
      <PARAMETER NAME="listsURL"  
DEFAULT="gate:/creole/gazeteer/default/lists.def" COMMENT="The URL to the  
file with list of  
lists" SUFFIXES="def">java.net.URL</PARAMETER>  
      <PARAMETER DEFAULT="UTF-8" NAME="encoding" COMMENT="The encoding used  
for reading the  
definitions">java.lang.String</PARAMETER>  
      <PARAMETER DEFAULT="true" NAME="caseSensitive" COMMENT="Should this  
gazetteer diferentiate on case. Currently the  
Gazetteer works only in case sensitive mode.">java.lang.Boolean</PARAMETER>  
      <ICON>shefGazetteer.gif</ICON>  
    </RESOURCE>

9.5 Flexible Gazetteer [#]

The Flexible Gazetteer provides users with the flexibility to choose their own customized input and an external Gazetteer. For example, the user might want to replace words in the text with their base forms (which is an output of the Morphological Analyser) or to segment a Chinese text (using the Chinese Tokeniser) before running the Gazetteer on the Chinese text.

The Flexible Gazetteer performs lookup over a document based on the values of an arbitrary feature of an arbitrary annotation type, by using an externally provided gazetteer. It is important to use an external gazetteer as this allows the use of any type of gazetteer (e.g. an Ontological gazetteer).

Input to the Flexible Gazetteer:

Runtime parameters:

Once the external gazetteer has annotated text with Lookup annotations, Lookup annotations on the temporary document are converted to Lookup annotations on the original document. Finally the temporary document is deleted.

9.6 Gazetteer List Collector [#]

The gazetteer list collector collects occurrences of entities directly from a set of annotated training texts, and populates gazetteer lists with the entities. The entity types and structure of the gazetteer lists are defined as necessary by the user. Once the lists have been collected, a semantic grammar can be used to find the same entities in new texts.

An empty list must be created first for each annotation type, if no list exists already. The set of lists must be loaded into GATE before the PR can be run. If a list already exists, the list will simply be augmented with any new entries. The list collector will only collect one occurrence of each entry: it first checks that the entry is not present already before adding a new one.

There are 4 runtime parameters:

Figure 9.1 shows a screenshot of a set of lists collected automatically for the Hindi language. It contains 4 lists: Person, Organisation, Location and a list of stopwords. Each list has a majorType whose value is the type of list, a minorType ”inferred” (since the lists have been inferred from the text), and the language ”Hindi”.


PIC


Figure 9.1: Lists collected automatically for Hindi

The list collector also has a facility to split the Person names that it collects into their individual tokens, so that it adds both the entire name to the list, and adds each of the tokens to the list (i.e. each of the first names, and the surname) as a separate entry. When the grammar annotates Persons, it can require them to be at least 2 tokens or 2 consecutive Person Lookups. In this way, new Person names can be recognised by combining a known first name with a known surname, even if they were not in the training corpus. Where only a single token is found that matches, an Unknown entity is generated, which can later be matched with an existing longer name via the orthomatcher component which performs orthographic coreference between named entities. This same procedure can also be used for other entity types. For example, parts of Organisation names can be combined together in different ways. The facility for splitting Person names is hardcoded in the file gate/src/gate/creole/GazetteerListsCollector.java and is commented.

9.7 Tree Tagger [#]

The TreeTagger is a language-independent part-of-speech tagger, which currently supports English, French, German, Spanish, Italian and Bulgarian (although the latter two are not available in GATE). It is integrated with GATE using a GATE CREOLE wrapper, originally designed by the CLaC lab (Computational Linguistics at Concordia), Concordia University, Montreal (http://www.cs.concordia.ca/research/researchgroups/clac.php).

The GATE wrapper calls TreeTagger as an external program, passing gate Tokens as input, and adding two new features to them, which hold the features as described below:

The TreeTagger plugin works on any platform that supports the tree tagger tool, including Linux, Mac OS X and Windows, but the GATE-specific scripts require a POSIX-style Bourne shell with the gawk, tr and grep commands, plus Perl for the Spanish tagger. For Windows this means that you will need to install the appropriate parts of the Cygwin environment from http://www.cygwin.com and set the system property treetagger.sh.path to contain the path to your sh.exe (typically C:\cygwin\bin\sh.exe). If this property is set, the TreeTagger plugin runs the shell given in the property and passes the tagger script as its first argument; without the property, the plugin will attempt to run the shell script directly, which fails on Windows with a cryptic “error=193”. For the GATE GUI, put the following line in build.properties (see section 3.3, and note the extra backslash before each backslash and colon in the path):

run.treetagger.sh.path: C\:\\cygwin\\bin\\sh.exe

Figure 9.2 shows a screenshot of a French document processed with the TreeTagger.


PIC

Figure 9.2: a TreeTagger processed document


9.7.1 POS tags

For English the POS tagset is a slightly modified version of the Penn Treebank tagset, where the second letter of the tags for verbs distinguishes between ”be” verbs (B), ”have” verbs (H) and other verbs (V).

The tagsets for French, German, Italian, Spanish and Bulgarian can be found in the original TreeTagger documenation at http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html..

9.8 Stemmer [#]

The stemmer plugin consists of a set of stemmers PRs for the following 11 European languages: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish. These take the form of wrappers for the Snowball stemmers freely available from http://snowball.tartarus.org. Each Token is annotated with a new feature ”stem”, with the stem for that word as its value. The stemmers should be run as other PRs, on a document that has been tokenised.

There are three runtime parameters which should be set prior to executing the stemmer on a document.

9.8.1 Algorithms

The stemmers are based on the Porter stemmer for English [Porter 80], with rules implemented in Snowball e.g.

define Step_1a as  
( [substring] among (  
 ’sses’ (<-’ss’)  
’ies’ (<-’i’)  
’ss’ () ’s’  (delete)  
 )  
 

9.9 GATE Morphological Analyzer [#]

The Morphological Analyser PR can be found in the Tools plugin. It takes as input a tokenized GATE document. Considering one token and its part of speech tag, one at a time, it identifies its lemma and an affix. These values are than added as features on the Token annotation. Morpher is based on certain regular expression rules. These rules were originally implemented by Kevin Humphreys in GATE1 in a programming language called Flex. Morpher has a capability to interepret these rules with an extension of allowing users to add new rules or modify the existing ones based on their requirements. In order to allow these operations with as little effort as possible, we changed the way these rules are written. More information on how to write these rules is explained later in Section 9.9.1.

Two types of parameters, Init-time and run-time, are required to instantiate and execute the PR.

9.9.1 Rule File [#]

GATE provides a default rule file, called default.rul, which is available under the gate/plugins/Tools/morph/resources directory. The rule file has two sections.

  1. Variables
  2. Rules

Variables

The user can define various types of variables under the section defineVars. These variables can be used as part of the regular expressions in rules. There are three types of variables:

  1. Range With this type of variable, theuser can specify the range of characters. e.g. A ==> [-a-z0-9]
  2. Set With this type of variable, user can also specify a set of characters, where one character at a time from this set is used as a value for the given variable. When this variable is used in any regular expression, all values are tried one by one to generate the string which is compared with thecontents of the document. e.g. A ==> [abcdqurs09123]
  3. Strings Where in the two types explained above, variables can hold only one character from the given set or range at a time, this allows specifying strings as possibilities for the variable. e.g. A ==> ”bb” OR ”cc” OR ”dd”

Rules

All rules are declared under the section defineRules. Every rule has two parts, LHS and RHS. The LHS specifies the regular expresssion and the RHS the function to be called when the LHS matches with the given word. ”==>” is used as delimeter between the LHS and RHS.

The LHS has the following syntax:

< * ”—”verb”—”noun>< regularexpression >.

User can specify which rule to be considered when the word is identified as ”verb” or ”noun”. ”*” indicates that the rule should be considered for all part-of-speech tags. If the part-of-speech should be used to decide if the rule should be considered or not can be enabled or disabled by setting the value of considerPOSTags option. Combination of any string along with any of the variables declared under the defineVars section and also the Klene operators, ”+” and ”*”, can be used to generate the regular expressions. Below we give few examples of L.H.S. expressions.

On the RHS of the rule, the user has to specify one of the functions from those listed below. These rules are hard-coded in the Morph PR in GATE and are invoked if the regular expression on the LHS matches with any particular word.

9.10 MiniPar Parser [#]

MiniPar is a shallow parser. In its shipped version, it takes one sentence as an input and determines the dependency relationships between the words of a sentence. It parses the sentence and brings out the information such as:

In the version of MiniPar integrated in GATE, it generates annotations of type“DepTreeNode” and the annotations of type “[relation]” that exists between the head and the child node. The document is required to have annotations of type “Sentence”, where each annotation consists of a string of the sentence.

Minipar takes one sentence at a time as an input and generates the tokens of type “DepTreeNode”. Later it assigns relation between these tokens. Each DepTreeNode consists of feature called “word”: this is the actual text of the word.

For each and every annotation of type “[Rel]”, where ‘Rel’ is obj, pred etc. This is the name of the dependency relationship between the child word and the head word (see Section 9.10.5). Every “[Rel]” annotation is assigned four features:

Figure 9.3 shows a MiniPar annotated document in GATE.


PIC

Figure 9.3: a MiniPar annotated document


9.10.1 Platform Supported

MiniPar in GATE is supported for the Linux and Windows operating systems. Trying to instantiate this PR on any other OS will generate the ResourceInstantiationException.

9.10.2 Resources

MiniPar in GATE is shipped with four basic resources:

9.10.3 Parameters

The MiniPar wrapper takes six parameters:

9.10.4 Prerequisites

The MiniPar wrapper requires the MiniPar library to be available on the underlying Linux/Windows machine. It can be downloaded from the MiniPar homepage.

9.10.5 Grammatical Relationships [#]

appo    "ACME president, --appo-> P.W. Buckman"  
aux "should <-aux-- resign"  
be  "is <-be-- sleeping"  
c   "that <-c-- John loves Mary"  
comp1   first complement  
det "the <-det ‘-- hat"  
gen "Jane’s <-gen-- uncle"  
i   the relationship between a C clause and its I clause  
inv-aux     inverted auxiliary: "Will <-inv-aux-- you stop it?"  
inv-be      inverted be: "Is <-inv-be-- she sleeping"  
inv-have    inverted have: "Have <-inv-have-- you slept"  
mod the relationship between a word and its adjunct modifier  
pnmod       post nominal modifier  
p-spec      specifier of prepositional phrases  
pcomp-c     clausal complement of prepositions  
pcomp-n     nominal complement of prepositions  
post        post determiner  
pre         pre determiner  
pred        predicate of a  clause  
rel         relative clause  
vrel        passive verb modifier of nouns  
wha, whn, whp:  wh-elements at C-spec positions  
obj         object of verbs  
obj2    second object of ditransitive verbs  
subj    subject of verbs  
s   surface subjec

9.11 RASP Parser [#]

RASP (Robust Accurate Statistical Parsing) is a robust parsing system for English, developed by the Natural Language and Computational Linguistics group at the University of Sussex.

This plugin, developed by DigitalPebble, provides four wrapper PRs that call the RASP modules as external programs, as well as a JAPE component that translates the output of the ANNIE POS Tagger (section 8.5).

RASP2 Tokenizer
This PR requires Sentence annotations and creates Token annotations with a string feature. Note that sentence-splitting must be carried out before tokenization; the the RegEx Sentence Splitter (see section 8.4) is suitable for this. (Alternatively, you can use the ANNIE Tokenizer (section 8.1) and then the ANNIE Sentence Splitter (section 8.3); their output is compatible with the other PRs in this plugin).
RASP2 POS Tagger
This requires Token annotations and creates WordForm annotations with pos, probability, and string features.
RASP2 Morphological Analyser
This requires WordForm annotations (from the POS Tagger) and adds lemma and suffix features.
RASP2 Parser
This requires the preceding annotation types and creates multiple Dependency annotations to represent a parse of each sentence.
RASP POS Converter
This PR requires Token annotations with a category feature as produced by the ANNIE POS Tagger (see section 8.5 and creates WordForm annotations in the RASP Format. The ANNIE POS Tagger and this Converter can together be used as a substitute for the RASP2 POS Tagger.

Here are some examples of corpus pipelines that can be correctly constructed with these PRs.

  1. RegEx Sentence Splitter
  2. RASP2 Tokenizer
  3. RASP2 POS Tagger
  4. RASP2 Morphological Analyser
  5. RASP2 Parser

  1. RegEx Sentence Splitter
  2. RASP2 Tokenizer
  3. ANNIE POS Tagger
  4. RASP POS Converter
  5. RASP2 Morphological Analyser
  6. RASP2 Parser

  1. ANNIE Tokenizer
  2. ANNIE Sentence Splitter
  3. RASP2 POS Tagger
  4. RASP2 Morphological Analyser
  5. RASP2 Parser

  1. ANNIE Tokenizer
  2. ANNIE Sentence Splitter
  3. ANNIE POS Tagger
  4. RASP POS Converter
  5. RASP2 Morphological Analyser
  6. RASP2 Parser

Futher documentation is included in the directory gate/plugins/rasp/doc/.

The RASP package, which provides the external programs, is available from the RASP web page.

RASP is only supported for Linux operating systems. Trying to run it on any other operating systems will generate an exception with the message: “The RASP cannot be run on any other operating systems except Linux.”

It must be correctly installed on the same machine as GATE, and must be installed in a directory whose path does not contain any spaces (this is a requirement of the RASP scripts as well as the wrapper). Before trying to run scripts for the first time, edit rasp.sh and rasp_parse.sh to set the correct value for the shell variable RASP, which should be the file system pathname where you have installed the RASP tools (for example, RASP=/opt/RASP or RASP=/usr/local/RASP. You will need to enter the same path for the initialization parameter raspHome for the POS Tagger, Morphological Analyser, and Parser PRs.

(On some systems the arch command used in the scripts is not available; a work-around is to comment that line out and add arch=’ix86_linux’, for example.)

(The previous version of the RASP plugin can now be found in plugins/Obsolete/rasp.)

9.12 SUPPLE Parser (formerly BuChart) [#]

The BuChart parser has been removed and replaced by SUPPLE: The Sheffield University Prolog Parser for Language Engineering. If you have an application which uses BuChart and wish to upgrade to a later version of GATE than 3.1 you must upgrade your application to use SUPPLE.

SUPPLE is a bottom-up parser that constructs syntax trees and logical forms for English sentences. The parser is complete in the sense that every analysis licensed by the grammar is produced. In the current version only the ’best’ parse is selected at the end of the parsing process. The English grammar is implemented as an attribute-value context free grammar which consists of subgrammars for noun phrases (NP), verb phrases (VP), prepositional phrases (PP), relative phrases (R) and sentences (S). The semantics associated with each grammar rule allow the parser to produce logical forms composed of unary predicates to denote entities and events (e.g., chase(e1), run(e2)) and binary predicates for properties (e.g. lsubj(e1,e2)). Constants (e.g., e1, e2) are used to represent entity and event identifiers. The GATE SUPPLE Wrapper stores syntactic infomation produced by the parser in the gate document in the form of parse annotations containing a bracketed representation of the parse; and semantics annotations that contains the logical forms produced by the parser. It also produces SyntaxTreeNode annotations that allow viewing of the parse tree for a sentence (see section 9.12.4).

9.12.1 Requirements

The SUPPLE parser is written in Prolog, so you will need a Prolog interpreter to run the parser. A copy of PrologCafe (http://kaminari.scitec.kobe-u.ac.jp/PrologCafe/), a pure Java Prolog implementation, is provided in the distribution. This should work on any platform but it is not particularly fast. SUPPLE also supports the open-source SWI Prolog (http://www.swi-prolog.org) and the commercially licenced SICStus prolog (http://www.sics.se/sicstus, SUPPLE supports versions 3 and 4), which are available for Windows, Mac OS X, Linux and other Unix variants. For anything more than the simplest cases we recommend installing one of these instead of using PrologCafe.

9.12.2 Building SUPPLE

The SUPPLE plugin must be compiled before it can be used, so you will require a suitable Java SDK (GATE itself requires only the JRE to run). To build SUPPLE, first edit the file build.xml in the SUPPLE directory under plugins, and adjust the user-configurable options at the top of the file to match your environment. In particular, if you are using SWI or SICStus Prolog, you will need to change the swi.executable or sicstus.executable property to the correct name for your system. Once this is done, you can build the plugin by opening a command prompt or shell, going to the SUPPLE directory and runing:

../../bin/ant swi

(on Windows, use ..\..\bin\ant). For PrologCafe or SICStus, replace swi with plcafe or sicstus as appropriate.

9.12.3 Running the parser in GATE

In order to parse a document you will need to construct an application that has:

Note that prior to GATE 3.1, the parser file parameter was of type java.io.File. From 3.1 it is of type java.net.URL. If you have a saved application (.gapp file) from before GATE 3.1 which includes SUPPLE it will need to be updated to work with the new version. Instructions on how to do this can be found in the README file in the SUPPLE plugin directory.

9.12.4 Viewing the parse tree [#]

GATE provides a syntax tree viewer in the Tools plugin which can display the parse tree generated by SUPPLE for a sentence. To use the tree viewer, be sure that the Tools plugin is loaded, then open a document that has been processed with SUPPLE and view its Sentence annotations. Right-click on the relevant Sentence annotation in the annotations table and select “Edit with syntax tree viewer”. This viewer can also be used with the constituency output of the Stanford Parser PR (section 9.13).

9.12.5 System properties [#]

The SICStusProlog (3 and 4) and SWIProlog implementations work by calling the native prolog executable, passing data back and forth in temporary files. The location of the prolog executable is specified by a system property:

If your prolog is installed under a different name, you should specify the correct name in the relevant system property. For example, when installed from the source distribution, the Unix version of SWI prolog is typically installed as pl, most binary packages install it as swipl, though some use the name swi-prolog. You can also use the properties to specify the full path to prolog (e.g. /opt/swi-prolog/bin/pl) if it is not on your default PATH.

For details of how to pass system properties to the GATE GUI, see the end of section 3.3.

9.12.6 Configuration files [#]

Two files are used to pass information from GATE to the SUPPLE parser: the mapping file and the feature table file.

Mapping file

The mapping file specifies how annotations produced using Gate are to be passed to the parser. The file is composed of a number of pairs of lines, the first line in a pair specifies a Gate annotation we want to pass to the parser. It includes the AnnotationSet (or default), the AnnotationType, and a number of features and values that depend on the AnnotationType. The second line of the pair specifies how to encode the Gate annotation in a SUPPLE syntactic category, this line also includes a number of features and values. As an example consider the mapping:

Gate;AnnotationType=Token;category=DT;string=&S  
SUPPLE;category=dt;m_root=&S;s_form=&S

It specifies how a determinant (’DT’) will be translated into a category ’dt’ for the parser. The construct ’&S’ is used to represent a variable that will be instantiated to the appropriate value during the mapping process. More specifically a token like ’The’ recognised as a DT by the POS-tagging will be mapped into the following category:

dt(s_form:’The’,m_root:’The’,m_affix:’_’,text:’_’).

As another example consider the mapping:

Gate;AnnotationType=Lookup;majorType=person_first;minorType=female;string=&S  
SUPPLE;category=list_np;s_form=&S;ne_tag=person;ne_type=person_first;gender=female

It specified that an annotation of type ’Lookup’ in Gate is mapped into a category ’list_np’ with specific features and values. More specifically a token like ’Mary’ identified in Gate as a Lookup will be mapped into the following SUPPLE category:

list_np(s_form:’Mary’,m_root:’_’,m_affix:’_’,  
text:’_’,ne_tag:’person’,ne_type:’person_first’,gender:’female’).

Feature table [#]

The feature table file specifies SUPPLE ’lexical’ categories and its features. As an example an entry in this file is:

n;s_form;m_root;m_affix;text;person;number

which specifies which features and in which order a noun category should be writen. In this case:

n(s_form:...,m_root:...,m_affix:...,text:...,person:...,number:....).

9.12.7 Parser and Grammar [#]

The parser builds a semantic representation compositionally, and a ‘best parse’ algorithm is applied to each final chart, providing a partial parse if no complete sentence span can be constructed. The parser uses a feature valued grammar. Each Category entry has the form:

Category(Feature1:Value1,...,FeatureN:ValueN)

where the number and type of features is dependent on the category type (see Section  6.1). All categories will have the features s_form (surface form) and m_root (morphological root); nominal and verbal categories will also have person and number features; verbal categories will also have tense and vform features; and adjectival categories will have a degree feature. The list_np category has the same features as other nominal categories plus ne_tag and ne_type.

Syntactic rules are specifed in Prolog with the predicate rule(LHS,RHS) where LHS is a syntactic category and RHS is a list of syntactic categories. A rule such as BNP_HEAD N (“a basic noun phrase head is composed of a noun”) is writen as follows:

rule(bnp_head(sem:E^[[R,E],[number,E,N]],number:N),  
[n(m_root:R,number:N)]).

where the feature ’sem’ is used to construct the semantics while the parser processes input, and E, R, and N are variables tobe instantiated during parsing.

The full grammar of this distribution can be found in the prolog/grammar directory, the file load.pl specifies which grammars are used by the parser. The grammars are compiled when the system is built and the compied version is used for parsing.

9.12.8 Mapping Named Entities

SUPPLE has a prolog grammar which deals with named entities, the only information required is the Lookup annotations produced by Gate, which are specified in the mapping file. However, you may want to pass named entities identified with your own Jape grammars in Gate. This can be done using a special syntactic category provided with this distribution. The category sem_cat is used as a bridge between Gate named entities and the SUPPLE grammar. An example of how to use it (provided in the mapping file) is:

Gate;AnnotationType=Date;string=&S  
SUPPLE;category=sem_cat;type=Date;text=&S;kind=date;name=&S

which maps a named entity ’Date’ into a syntactic category ’sem_cat’. A grammar file called semantic_rules.pl is provided to map sem_cat into the appropriate syntactic category expected by the phrasal rules. The following rule for example:

rule(ne_np(s_form:F,sem:X^[[name,X,NAME],[KIND,X]]),[  
sem_cat(s_form:F,text:TEXT,type:’Date’,kind:KIND,name:NAME)]).

is used to parse a ’Date’ into a named entity in SUPPLE which in turn will be parsed into a noun phrase.

9.12.9 Upgrading from BuChart to SUPPLE

In theory upgrading from BuChart to SUPPLE should be relatively straightforward. Basically any instance of BuChart needs to be replaced by SUPPLE. Specific changes which must be made are:

Making these changes to existing code should be trivial and allow application to benefit from future improvements to SUPPLE.

9.13 Stanford Parser [#]

The Stanford Parser is a probabilistic parsing system implemented in Java by Stanford University’s Natural Language Processing Group. Data files are available from Stanford for parsing Arabic, Chinese, English, and German.

This plugin, developed by the GATE team, provides a PR (gate.stanford.Parser) that acts as a wrapper around the Stanford Parser (version 1.6.1) and translates GATE annotations to and from the data structures of the parser itself. The plugin is supplied with the unmodified jar file and one English data file obtained from Stanford. Stanford’s software itself is subject to the full GPL.

The parser itself can be trained on other corpora and languages, as documented on the website, but this plugin does not provide a means of doing so. Trained data files are not compatible between different versions of the parser; in particular, note that you need version 1.6.1 data files for GATE builds numbered above 3120 (when we upgraded the plugin to Stanford version 1.6.1 on 22 January 2009) but version 1.6 files for earlier versions, including Release 5.0 beta 1.

Creating multiple instances of this PR in the same JVM with different trained data files does not work—the PRs can be instantiated, but runtime errors will almost certainly occur.

9.13.1 Input requirements

Documents to be processed by the Parser PR must already have Sentence and Token annotations, such as those produced by either ANNIE Sentence Splitter (sections 8.3 and 8.4) and the ANNIE English Tokeniser (section 8.1).

If the reusePosTags parameter is true, then the Token annotations must have category features with compatible POS tags. The tags produced by the ANNIE POS Tagger are compatible with Stanford’s parser data files for English (which also use the Penn treebank tagset).

9.13.2 Initialization parameters

parserFile
the path to the trained data file; the default value points to the English data file2 included with the GATE distribution. You can also use other files downloaded from the Stanford Parser website or produced by training the parser.
mappingFile
the optional path to a mapping file: a flat, two-column file which the wrapper can use to “translate” tags. A sample file is included.3 By default this value is null and mapping is ignored.
tlppClass
an implementation of TreebankLangParserParams, used by the parser itself to extract the dependency relations from the constituency structures. The default value is compatible with the English data file supplied. Please refer to the Stanford NLP Group’s documentation and the parser’s javadoc for a further explanation.

9.13.3 Runtime parameters

annotationSetName
the name of the annotationSet used for input (Token and Sentence annotations) and output (SyntaxTreeNode and Dependency annotations, and category and dependencies features added to Tokens).
debug
a boolean value which controls the verbosity of the wrapper’s output.
reusePosTags
if true, the wrapper will read category features (produced by an earlier POS-tagging PR) from the Token annotations and force the parser to use them.
useMapping
if this is true and a mapping file was loaded when the PR was initialized, the POS and syntactic tags produced by the parser will be translated using that file. If no mapping file was loaded, this parameter is ignored.

The following boolean parameters switch on and off the various types of output that the parser can produce. Any or all of them can be true, but if all are false the PR will simply print a warning to save time (instead of running the parser).

addPosTags
if this is true, the wrapper will add category features to the Token annotations.
addConstituentAnnotations
if true, the wrapper will mark the syntactic constituents with SyntaxTreeNode annotations that are compatible with the Syntax Tree Viewer (see section 9.12.4).
addDependencyAnnotations
if true, the wrapper will add Dependency annotations to indicate the dependency relations in the sentence.
addDependencyFeatures
if true, the wrapper will add dependencies features to the Token annotations to indicate the dependency relations in the sentence.

The parser will derive the dependency structures only if either or both of the dependency output options is enabled, so if you do not need the dependency analysis, you can disable both of them and the PR will run faster.

Two sample GATE applications for English are included in the plugins/Stanford directory: sample_parser_en.gapp runs the Regex Sentence Splitter and ANNIE Tokenizer and then this PR to annotate constituency and dependency structures, whereas sample_pos+parser_en.gapp also runs the ANNIE POS Tagger and makes the parser re-use its POS tags.

9.14 Montreal Transducer [#]

Many of the key features introduced in the Montreal Transducer (MT) have now been ported in some form into the standard JAPE transducer. If you are considering using the MT, you should first check the documentation for the standard transducer in chapter 7 to see if that is suitable for your needs. Being such a core part of GATE, the standard JAPE transducer is likely to be more stable and bugs will be fixed more rapidly than with the MT.

The Montreal Transducer is an improved Jape Transducer, developed by Luc Plamondon, Université de Montréal. It is intended to make grammar authoring easier by providing a more flexible version of the JAPE language and it also fixes a few bugs. Full details of the transducer can be found at http://www.iro.umontreal.ca/ plamondl/mtltransducer/. We summarise the main features below.

9.14.1 Main Improvements

9.14.2 Main Bug fixes

({Lookup.majorType == title})+:titles ({Token.orth == upperInitial})*:names

9.15 Language Plugins [#]

There are plugins available for processing the following languages: French, German, Spanish, Italian, Chinese, Arabic, Romanian, Hindi and Cebuano. Some of the applications are quite basic and just contain some useful processing resources to get you started when developing a full application. Others (Cebuano and Hindi) are more like toy systems built as part of an exercise in language portability.

Note that if you wish to use individual language processing resources without loading the whole application, you will need to load the relevant plugin for that language in most cases. The plugins all follow the same kind of format. Load the plugin using the plugin manager, and the relevant resources will be available in the Processing Resources set.

Some plugins just contain a list of resources which can be added ad hoc to other applications. For example, the Italian plugin simply contains a lexicon which can be used to replace the English lexicon in the default English POS tagger: this will provide a reasonable basic POS tagger for Italian.

In most cases you will also find a directory in the relevant plugin directory called data which contains some sample texts (in some cases, these are annotated with NEs).

9.15.1 French Plugin [#]

The French plugin contains two applications for NE recognition: one which includes the TreeTagger for POS tagging in French (french+tagger.gapp) , and one which does not (french.gapp). Simply load the application required from the plugins/french directory. You do not need to load the plugin itself from the plugins menu. Note that the TreeTagger must first be installed and set up correctly (see Section 9.7 for details). Check that the runtime parameters are set correctly for your TreeTagger in your application. The applications both contain resources for tokenisation, sentence splitting, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Note that they are not intended to produce high quality results, they are simply a starting point for a developer working on French. Some sample texts are contained in the plugins/french/data directory.

9.15.2 German Plugin [#]

The German plugin contains two applications for NE recognition: one which includes the TreeTagger for POS tagging in German (german+tagger.gapp) , and one which does not (german.gapp). Simply load the application required from the plugins/german/resources directory. You do not need to load the plugin itself from the plugins menu. Note that the TreeTagger must first be installed and set up correctly (see Section 9.7 for details). Check that the runtime parameters are set correctly for your TreeTagger in your application. The applications both contain resources for tokenisation, sentence splitting, gazetteer lookup, compound analysis, NE recognition (via JAPE grammars) and orthographic coreference. Some sample texts are contained in the plugins/german/data directory. We are grateful to Fabio Ciravegna and the Dot.KOM project for use of some of the components for the German plugin.

9.15.3 Romanian Plugin [#]

The Romanian plugin contains an application for Romanian NE recognition (romanian.gapp). Simply load the application from the plugins/romanian/resources directory. You do not need to load the plugin itself from the plugins menu. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Some sample texts are contained in the plugins/romanian/corpus directory.

9.15.4 Arabic Plugin [#]

The Arabic plugin contains a simple application for Arabic NE recognition (arabic.gapp). Simply load the application from the plugins/arabic/resources directory. You do not need to load the plugin itself from the plugins menu. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Note that there are two types of gazetteer used in this application: one which was derived automatically from training data (Arabic inferred gazetteer), and one which was created manually. Note that there are some other applications included which perform quite specific tasks (but can generally be ignored). For example, arabic-for-bbn.gapp and arabic-for-muse.gapp make use of a very specific set of training data and convert the result to a special format. There is also an application to collect new gazetteer lists from training data (arabic_lists_collector.gapp). For details of the gazetteer list collector please see Section 9.6.

9.15.5 Chinese Plugin [#]

The Chinese plugin contains a simple application for Chinese NE recognition (chinese.gapp). Simply load the application from the plugins/chinese/resources directory. You do not need to load the plugin itself from the plugins menu. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. The application makes use of some gazetteer lists (and a grammar to process them) derived automatically from training data, as well as regular hand-crafted gazetteer lists. There are also applications (listscollector.gapp, adj_collector.gapp and nounperson_collector.gapp) to create such lists, and various other application to perform special tasks such as coreference evaluation (coreference_eval.gapp) and converting the output to a different format (ace-to-muse.gapp).

9.15.6 Hindi Plugin [#]

The Hindi plugin contains a set of resources for basic Hindi NE recognition which mirror the ANNIE resources but are customised to the Hindi language. You need to have the ANNIE plugin loaded first in order to load any of these PRs. With the Hindi, you can create an application similar to ANNIE but replacing the ANNIE PRs with the default PRs from the plugin.

9.16 Chemistry Tagger [#]

This GATE module is designed to tag a number of chemistry items in running text. Currently the tagger tags compound formulas (e.g. SO2, H2O, H2SO4 ...) ions (e.g. Fe3+, Cl-) and element names and symbols (e.g. Sodium and Na). Limited support for compound names is also provided (e.g. sulphur dioxide) but only when followed by a compound formula (in parenthesis or commas).

9.16.1 Using the tagger

The Tagger requires the Creole plugin ”Chemistry_Tagger” to be loaded. It requires the following PRs to have been run first: tokeniser and sentence splitter. There are four init parameters giving the locations of the two gazetteer list definitions, the element mapping file and the JAPE grammar used by the tagger (in previous versions of the tagger these files were fixed and loaded from inside the ChemTagger.jar file). Unless you know what you are doing you should accept the default values.

The annotations added to documents are ”ChemicalCompound”, ”ChemicalIon” and ”ChemicalElement” (currently they are always placed in the default annotation set). By default ”ChemicalElement” annotations are removed if they make up part of a larger compound or ion annotation. This behaviour can be changed by setting the removeElements parameter to false so that all recognised chemical elements are annotated.

9.17 Flexible Exporter [#]

The Flexible Exporter enables the user to save a document (or corpus) in its original format with added annotations. The user can select the name of the annotation set from which these annotations are to be found, which annotations from this set are to be included, whether features are to be included, and various renaming options such as renaming the annotations and the file.

At load time, the following parameters can be set for the flexible exporter:

The following runtime parameters can also be set (after the file has been selected for the application):

9.18 Annotation Set Transfer [#]

The Annotation Set Transfer allows copying or moving annotations to a new annotation set if they lie between the beginning and the end of an annotation of a particular type (the covering annotation). For example, this can be used when a user only wants to run a processing resource over a specific part of a document, such as the Body of an HTML document. The user specifies the name of the annotation set and the annotation which covers the part of the document they wish to transfer, and the name of the new annotation set. All the other annotations corresponding to the matched text will be transferred to the new annotation set. For example, we might wish to perform named entity recognition on the body of an HTML text, but not on the headers. After tokenising and performing gazetteer lookup on the whole text, we would use the Annotation Set Transfer to transfer those annotations (created by the tokeniser and gazetteer) into a new annotation set, and then run the remaining NE resources, such as the semantic tagger and coreference modules, on them.

The Annotation Set Transfer has no loadtime parameters. It has the following runtime parameters:

For example, suppose we wish to perform named entity recognition on only the text covered by the BODY annotation from the Original Markups annotation set in an HTML document. We have to run the gazetteer and tokeniser on the entire document, because since these resources do not depend on any other annotations, we cannot specify an input annotation set for them to use. We therefore transfer these annotations to a new annotation set (Filtered) and then perform the NE recognition over these annotations, by specifying this annotation set as the input annotation set for all the following resources. In this example, we would set the following parameters (assuming that the annotations from the tokenise and gazetteer are initially placed in the Default annotation set).

9.19 Information Retrieval in GATE [#]

GATE comes with a full-featured Information Retrieval (IR) subsystem that allows queries to be performed against GATE corpora. This combination of IE and IR means that documents can be retrieved from the corpora not only based on their textual content but also according to their features or annotations. For example, a search over the Person annotations for ”Bush” will return documents with higher relevance, compared to a search in the content for the string ”bush”. The current implementation is based on the most popular open source full-text search engine - Lucene (available at http://jakarta.apache.org/lucene/) but other implementations may be added in the future.

An Information Retrieval system is most often considered a system that accepts as input a set of documents (corpus) and a query (combination of search terms) and returns as input only those documents from the corpus which are considered as relevant according to the query. Usually, in addition to the documents, a proper relevance measure (score) is returned for each document. There exist many relevance metrics, but usually documents which are considered more relevant, according to the query, are scored higher.

Figure 9.4 shows the results from running a query against an indexed corpus in GATE.


PIC


Figure 9.4: Documents with scores, returned from a search over a corpus


Information Retrieval systems usually perform some preprocessing one the input corpus in order to create the document-term matrix for the corpus. A document-term matrix is usually presented as:








Term1Term2......termk






Doc1 w1,1 w1,2 ...... w1,k






Doc2 w2,1 w2,1 ...... w2,k






... ... ... ...... ...






... ... ... ...... ...






docn wn, 1 wn,2 ...... wn,k







Table 9.1:

where doci is a document from the corpus, termj is a word that is considered as important and representative for the document and wi,j is the weight assigned to the term in the document. There are many ways to define the term weight functions, but most often it depends on the term frequency in the document and in the whole corpus (i.e. the local and the global frequency). Note that the machine learning plugin described in Chapter 11 can produce such document-term matrix (for detailed description of the matrix produced see Section 11.5.4).

Note that not all of the words appearing in the document are considered terms. There are many words (called ”stop-words”) which are ignored, since they are observed too often and are not representative enough. Such words are articles, conjunctions, etc. During the preprocessing phase which identifies such words, usually a form of stemming is performed in order to minimize the number of terms and to improve the retrieval recall. Various forms of the same word (e.g. ”play”, ”playing” and ”played”) are considered identical and multiple occurrences of the same term (probably ”play”) will be observed.

It is recommended that the user reads the relevant Information Retrieval literature for a detailed explanation of stop words, stemming and term weighting.

IR systems, in a way similar to IE systems, are evaluated with the help of the precision and recall measures (see Section 13.4 for more details).

9.19.1 Using the IR functionality in GATE

In order to run queries against a corpus, the latter should be ”indexed”. The indexing process first processes the documents in order to identify the terms and their weights (stemming is performed too) and then creates the proper structures on the local filesystem. These file structures contain indexes that will be used by Lucene (the underlying IR engine) for the retrieval.

Once the corpus is indexed, queries may be run against it. Subsequently the index may be removed and then the structures on the local filesytem are removed too. Once the index is removed, queries cannot be run against the corpus.

Indexing the corpus

In order to index a corpus, the latter should be stored in a serial datastore. In other words, the IR functionality is unavailable for corpora that are transient or stored in a RDBMS datastores (though support for the lattr may be added in the future).

To index the corpus, follow these steps:


PIC


Figure 9.5: Indexing a corpus by specifying the index location and indexed features (and content)


Querying the corpus

To query the corpus, follow these steps:

Removing the index

An index for a corpus may be removed at any time from the ”Remove Index” option of the context menu for the indexed corpus (right button click).

9.19.2 Using the IR API

The IR API within GATE makes it possible for corpora to be indexed, queried and results returned from any Java application, without using the GATE GUI. The following sample indexes a corpus, runs a query against it and then removes the index.

 
// open a serial data store  
SerialDataStore sds =  
Factory.openDataStore("gate.persist.SerialDataStore",  
"/tmp/datastore1");  
sds.open();  
 
//set an AUTHOR feature for the test document  
Document doc0 = Factory.newDocument(new URL("/tmp/documents/doc0.html"));  
doc0.getFeatures().put("author","John Smit");  
 
Corpus corp0 = Factory.newCorpus("TestCorpus");  
corp0.add(doc0);  
 
//store the corpus in the serial datastore  
Corpus serialCorpus = (Corpus) sds.adopt(corp0,null);  
sds.sync(serialCorpus);  
 
//index the corpus -  the content and the AUTHOR feature  
 
IndexedCorpus indexedCorpus = (IndexedCorpus) serialCorpus;  
 
DefaultIndexDefinition did = new DefaultIndexDefinition();  
did.setIrEngineClassName(gate.creole.ir.lucene. LuceneIREngine.class.getName());  
did.setIndexLocation("/tmp/index1");  
did.addIndexField(new IndexField("content", new DocumentContentReader(), false));  
did.addIndexField(new IndexField("author", null, false));  
indexedCorpus.setIndexDefinition(did);  
 
indexedCorpus.getIndexManager().createIndex();  
//the corpus is now indexed  
 
//search the corpus  
Search search = new LuceneSearch();  
search.setCorpus(ic);  
 
QueryResultList res = search.search("+content:government +author:John");  
 
//get the results  
Iterator it = res.getQueryResults();  
while (it.hasNext()) {  
QueryResult qr = (QueryResult) it.next();  
System.out.println("DOCUMENT_ID="+ qr.getDocumentID() +",   scrore="+qr.getScore());  
}  

9.20 Crawler [#]

The crawler plugin enables GATE to be used for a corpus that is built using a web crawl. The crawler itself is Websphinx.This is a JAVA based multi-threaded web crawler that can be customized for any application. In order to use this plugin it may be required that the websphinx.jar file be added in the required libraries in JBuilder.

The basic idea is to be able to specify a source URL and a depth to build the initial corpus upon which further processing could be done. The PR itself provides a number of helpful features to set various parameters of the crawl.

9.20.1 Using the Crawler PR

In order to use the processing resource you first need to load the plugin using the plugin manager. Then load the crawler from the list of processing resources. User needs to create a corpus in which he or she wants to store crawled documents. In order to use the crawler, create a simple pipeline (note: do not create a corpus pipeline) and add the crawl PR to the pipeline.

Once the crawl PR is created there will be a number of parameters that can be set based on the PR required (see also Figure 9.6).


PIC


Figure 9.6: Crawler parameters


Once the parameters are set, the crawl can be run and the documents fetched are added to the specified corpus. Figure 9.7 shows the crawled pages added to the corpus.


PIC


Figure 9.7: Crawled pages added to the corpus


9.21 Google Plugin [#]

The Google API is now integrated with GATE, and can be used as a PR-based plugin. This plugin allows the user to query Google and build the document corpus that contains the search results returned by Google for the query. There is a limit of 1000 queries per day as set by Google. For more information about the Google API please refer to http://www.google.com/apis/. In order to use the Google PR, you need to register with Google to obtain a license key.

The Google PR can be used for a number of different application scenarios. For example, one use case is where a user wants to find out what are the different named entities that can be associated with a particular individual. In this example, the user could build the collection of documents by querying Google and then running ANNIE over the collection. This would annotate the results and show what are the different Organization, Location and other entities that can be associated with the query.

9.21.1 Using the GooglePR

In order to use the PR, you first need to load the plugin using the plugin manager. Once the PR is loaded, it can be initialized by creating an instance of a new PR. Here you need to specify the Google API License key. Please use the license key assigned to you by registering with Google.

Once the Google PR is initialized, it can be placed in a pipeline or a conditional pipeline application. This pipeline would contain the instance of the Google PR just initialized as above. There are a number of parameters to be set at runtime:

Once the required parameters are set we can run the pipeline. This will then download all the URLs in the results and create a document for each. These documents would be added to the corpus as shown in Figure 9.8.


PIC


Figure 9.8: URLs added to the corpus


9.22 Yahoo Plugin [#]

The Yahoo API is now integrated with GATE, and can be used as a PR-based plugin. This plugin allows the user to query Yahoo and build the document corpus that contains the search results returned by Yahoo for the query. For more information about the Yahoo API please refer to http://developer.yahoo.com/search/. In order to use the Yahoo PR, you need to obtain an application ID.

The Yahoo PR can be used for a number of different application scenarios. For example, one use case is where a user wants to find out what are the different named entities that can be associated with a particular individual. In this example, the user could build the collection of documents by querying Yahoo and then running ANNIE over the collection. This would annotate the results and show what are the different Organization, Location and other entities that can be associated with the query.

9.22.1 Using the YahooPR

In order to use the PR, you first need to load the plugin using the plugin manager. Once the PR is loaded, it can be initialized by creating an instance of a new PR. Here you need to specify the Yahoo Application ID. Please use the license key assigned to you by registering with Yahoo.

Once the Yahoo PR is initialized, it can be placed in a pipeline or a conditional pipeline application. This pipeline would contain the instance of the Yahoo PR just initialized as above. There are a number of parameters to be set at runtime:

Once the required parameters are set we can run the pipeline. This will then download all the URLs in the results and create a document for each. These documents would be added to the corpus.

9.23 WordNet in GATE [#]


PIC


Figure 9.9: WordNet in GATE – results for “bank”



PIC


Figure 9.10: WordNet in GATE


At present GATE supports only WordNet 1.6, so in order to use WordNet in GATE, you must first install WordNet 1.6 on your computer. WordNet is available at http://wordnet.princeton.edu/. The next step is to configure GATE to work with your local WordNet installation. Since GATE relies on the Java WordNet Library (JWNL) for WordNet access, this step consists of providing one special xml file that is used internally by JWNL. This file describes the location of your local copy of the WordNet 1.6 index files. An example of this wn-config.xml file is shown below:

 
<?xml version="1.0" encoding="UTF-8"?>  
 
<jwnl_properties language="en">  
        <version publisher="Princeton" number="1.6" language="en"/>  
        <dictionary class="net.didion.jwnl.dictionary.FileBackedDictionary">  
                <param name="morphological_processor" value="net.didion.jwnl.dictionary.DefaultMorphologicalProcessor"/>  
                <param name="file_manager" value="net.didion.jwnl.dictionary.file_manager.FileManagerImpl">  
                        <param name="file_type" value="net.didion.jwnl.princeton.file.PrincetonRandomAccessDictionaryFile"/>  
                        <param name="dictionary_path" value="e:\wn16\dict"/>  
                </param>  
        </dictionary>  
        <dictionary_element_factory class="net.didion.jwnl.princeton.data.PrincetonWN16DictionaryElementFactory"/>  
        <resource class="PrincetonResource"/>  
</jwnl_properties>  

All you have to do is to replace the value of the dictionary_path parameter to point to your local installation of WordNet 1.6.

After configuring GATE to use WordNet, you can start using the built-in WordNet browser or API. In GATE, load the WordNet plugin via the plugins menu. Then load WordNet by selecting it from the set of available language resources. Set the value of the parameter to the path of the xml properties file which describes the WordNet location (wn-config).

Once Word Net is loaded in GATE, the well-known interface of WordNet will appear. You can search Word Net by typing a word in the box next to to the label “SearchWord”’ and then pressing “Search”. All the senses of the word will be displayed in the window below. Buttons for the possible parts of speech for this word will also be activated at this point. For instance, for the word “play”, the buttons “Noun”, “Verb” and “Adjective” are activated. Pressing one of these buttons will activate a menu with hyponyms, hypernyms, meronyms for nouns or verb groups, and cause for verbs, etc. Selecting an item from the menu will display the results in the window below.

More information about WordNet can be found at http://www.cogsci.princeton.edu/wn/index.shtml

More information about the JWNL library can be found at http://sourceforge.net/projects/jwordnet

An example of using the WordNet API in GATE is available on the GATE examples page at http://gate.ac.uk/GateExamples/doc/index.html

9.23.1 The WordNet API

GATE offers a set of classes that can be used to access the WordNet 1.6 Lexical Base. The implementation of the GATE API for WordNet is based on Java WordNet Library (JWNL). There are just a few basic classes, as shown in Figure 9.11. Details about the properties and methods of the interfaces/classes comprising the API can be obtained from the JavaDoc. Below is a brief overview of the interfaces:


PIC

Figure 9.11: The Wordnet API


9.24 Machine Learning in GATE [#]

Note: A brand new machine learning layer specifically targetted at NLP tasks including text classification, chunk learning (e.g. for named entity recognition) and relation learning has been added to GATE. See chapter 11 for more details.

9.24.1 ML Generalities

This section describes the use of Machine Learning (ML) algorithms in GATE.

An ML algorithm ”learns” about a phenomenon by looking at a set of occurrences of that phenomenon that are used as examples. Based on these, a model is built that can be used to predict characteristics of future (and unforeseen) examples of the phenomenon.

Classification is a particular example of machine learning in which the set of training examples is split into multiple subsets (classes) and the algorithm attempts to distribute the new examples into the existing classes.

This is the type of ML that is used in GATE and all further references to ML actually refer to classification.

Some definitions

GATE-specific interpretation of the above definitions

An ML implementation has two modes of functioning: training and application. The training phase consists of building a model (e.g. statistical model, a decision tree, a rule set, etc.) from a dataset of already classified instances. During application, the model built while training is used to classify new instances.

There are ML algorithms which permit the incremental building of the model (e.g. the Updateable Classifiers in the WEKA library). These classifiers do not require the entire training dataset to build a model; the model improves with each new training instance that the algorithm is provided with.

9.24.2 The Machine Learning PR in GATE

Access to ML implementations is provided in GATE by the ”Machine Learning PR” that handles both the training and application of ML model on GATE documents. This PR is a Language Analyser so it can be used in all default types of GATE controllers.

In order to allow for more flexibility, all the configuration parameters for the ML PR are set through an external XML file and not through the normal PR parameterisation. The root element of the file needs to be called ”ML-CONFIG” and it contains two elements: ”DATASET” and ”ENGINE”. An example XML configuration file is given in Appendix F.

The DATASET element

The DATASET element defines the type of annotation to be used as instance and the set of attributes that characterise all the instances.

An ”INSTANCE-TYPE” element is used to select the annotation type to be used for instances, and the attributes are defined by a sequence of ”ATTRIBUTE” elements.

For example, if an ”INSTANCE-TYPE” has a ”Token” for value, there will one instance in the dataset per ”Token”. This also means that the positions (see below) are defined in relation to Tokens. The ”INSTANCE-TYPE” can be seen as the smallest unit to be taken into account for the Machine Learning.

An ATTRIBUTE element has the following sub-elements:

The VALUES being defined as XML entities, the characters <, > and & must be replaced by &lt;, &rt; and &amp;. It is recommended to write the XML configuration file in UTF-8 in order to have some uncommon character correctly parsed.

Semantically, there are three types of attributes:

Figure 9.12 gives some examples of what the values of specified attributes would be in a situation when ”Token” annotations are used as instances.


PIC


Figure 9.12: Sample attributes and their values


An ATTRIBUTELIST element is similar to ATTRIBUTE except that it has no POSITION sub-element but a RANGE element. This will be converted into several ATTRIBUTELIST with position ranging from the value of the attribute ”from” to the value of the attribute ”to”. This can be used in order to avoid the duplication of ATTRIBUTE elements.

The ENGINE element

The ENGINE element defines which particular ML implementation will be used, and allows the setting of options for that particular implementation.

The ENGINE element has three sub-elements:

9.24.3 The WEKA Wrapper

GATE provides a wrapper for the WEKA ML Library (http://www.cs.waikato.ac.nz/ml/weka/) in the form of the gate.creole.ml.weka.Wrapper class.

Options for the WEKA wrapper

The WEKA wrapper accepts the following options:

9.24.4 Training an ML model with the ML PR and WEKA wrapper

The ML PR has a Boolean runtime parameter named ”training”. When the value of this parameter is set to true, the PR will collect a dataset of instances from the documents on which it is run. If the classifier used is an updatable classifier then the ML model will be built while collecting the dataset. If the selected classifier is not updatable, then the model will be built the first time a classification is attempted.

Training a model consists of designing a definition file for the ML PR, and creating an application containing an ML PR. When the application is run over a corpus, the dataset (and the model if possible) is built.

9.24.5 Applying a learnt model

Using the same ML PR, set the ”training” parameter to false and run your application.

Depending on the type of the attribute that is marked as class, different actions will be performed when a classification occurs:

Once a model is learnt, it can be saved and reloaded at a later time. The WEKA wrapper also provides an operation for saving only the dataset in the ARFF format, which can be used for experiments in the WEKA interface. This could be useful for determining the best algorithm to be used and the optimal options for the selected algorithm.

9.24.6 The MAXENT Wrapper [#]

GATE also provides a wrapper for the Open NLP MAXENT library (http://maxent.sourceforge.net/about.html). The MAXENT library provides an implementation of the maximum entropy learning algorithm, and can be accessed using the gate.creole.ml.maxent.MaxentWrapper class.

The MAXENT library requires all attributes except for the class attribute to be boolean, and that the class attribute be boolean or nominal. (It should be noted that, within maximum entropy terminology, the class attribute is called the ’outcome’.) Because the MAXENT library does not provide a specific format for data sets, there is no facility to save or load data sets separately from the model, but if there should be a need to do this, the WEKA wrapper can be used to collect the data.

Training a MAXENT model follows the same general procedure as for WEKA models, but the following difference should be noted. MAXENT models are not updateable, so the model will always be created and trained the first time a classification is attempted. The training of the model might take a considerable amount of time, depending on the amount of training data and the parameters of the model.

Options for the MAXENT Wrapper

9.24.7 The SVM Light Wrapper [#]

From version 3.0, GATE provides a wrapper for the SVM Light ML system (http://svmlight.joachims.org). SVM Light is a support vector machine implementation, written in C, which is provided as a set of command line programs. The GATE wrapper takes care of the mundane work of converting the data structures between GATE and SVM Light formats, and calls the command line programs in the right sequence, passing the data back and forth in temporary files. The <WRAPPER> value for this engine is gate.creole.ml.svmlight.SVMLightWrapper.

The SVM Light binaries themselves are not distributed with GATE – you should download the version for your platform from http://svmlight.joachims.org and place svm_learn and svm_classify on your path.

Classifying documents using the SVMLightWrapper is a two phase procedure. In its first phase, SVMWrapper collects data from the pre-annotated documents and builds the SVM model using the collected data to classify the unseen documents in its second phase. Below we describe briefly an example of classifying the start time of the seminar in a corpus of email announcing seminars and provide more details later in the section.

Figure 9.13 explains step by step the process of collecting training data for the SVM classifier. GATE documents, which are pre-annotated with the annotations of type Class and feature type=’stime’, are used as the training data. In order to build the SVM model, we require start and end annotations for each stime annotation. We use pre-processor JAPE transduction script to mark the sTimeStart and sTimeEnd annotations on stime annotations. Following this step, the Machine Learning PR (SVMLightWrapper) with training mode set to true collects the training data from all training documents. GATE corpus pipeline, given a set of documents and PRs to execute on them, executes all PRs one by one only on one document at a time. Unless provided in a separate pipleline, it makes it impossible to send all training data (i.e. collected from all documents) altogether to the SVMWrapper using the same pipeline to build the SVM model. This results into the model not being built at the time of collecting training data. The state of the SVMWrapper can be saved to an external file once the training data is collected.


PIC

Figure 9.13: Flow diagram explaining the SVM training data collection


Before classifying any unseen document, SVM requires the SVM model to be available. In the absence of an up-to-date SVM model, SVMWrapper builds a new one using a command line SVM_learn utility and the training data collected from the training corpus. In other words, the first SVM model is built when user tries to classify the first document. At this point the user has an option to save the model somewhere on the external storage. This is in order to reload the model prior to classifying other documents and to avoid rebuilding of the SVM model everytime the user classifies a new set of documents. Once the model becomes available, SVMWrapper classifies the unseen documents which creates new sTimeStart and sTimeEnd annotations over the text. Finally, a post-processor JAPE transduction script is used to combine them into the sTime annotation. Figure 9.14 explains this process.


PIC

Figure 9.14: Flow diagram explaining document classifying process


The wrapper allows support vector machines to be created which either do boolean classification or regression (estimation of numeric parameters), and so the class attribute can be boolean or numeric. Additionally, when learning a classifier, SVM Light supports transduction, whereby additional examples can be presented during training which do not have the value of the class attribute marked. Presenting such examples can, in some circumstances, greatly improve the performance of the classifier. To make use of this within GATE, the class attribute can be a three value nominal, in which case the first value specified for that nominal in the configuration file will be interpreted as true, the second as false and the third as unknown. Transduction will be used with any instances for which this attribute is set to the unknown value. It is also possible to use a two value nominal as the class attribute, in which case it will simply be interpreted as true or false.

The other attributes can be boolean, numeric or nominal, or any combination of these. If an attribute is nominal, each value of that attribute maps to a separate SVM Light feature. Each of these SVM Light features will be given the value 1 when the nominal attribute has the corresponding value, and will be omitted otherwise. If the value of the nominal is not specified in the configuration file or there is no value for an instance, then no feature will be added.

An extension to the basic functionality of SVM Light is that each attribute can receive a weighting. These weighting can be specified in the configuration file by adding <WEIGHTING> tags to the parts of the XML file specifying each attribute. The weighting for the attribute must be specified as a numeric value, and be placed between an opening <WEIGHTING> tag and a closing </WEIGHTING> one. Giving an attribute a greater weighting, will cause it to play a greater role in learning the model and classifying data. This is achieved by multiplying the value of the attribute by the weighting before creating the training or test data that is passed to SVM Light. Any attribute left without an explicitly specified weighting is given a default weighting of one. Support for these weightings is contained in the Machine Learning PR itself, and so is available to other wrappers, though at time of writing only the SVM Light wrapper makes use of weightings.

As with the MAXENT wrapper, SVM Light models are not updateable, so the model will be trained at the first classification attempt. The SVM Light wrapper supports <BATCH-MODE-CLASSIFICATION />, which should be used unless you have a very good reason not to.

The SVM Light wrapper allows both data sets and models to be loaded and saved to files in the same formats as those used by SVM Light when it is run from the command line. When a model is saved, a file will be created which contains information about the state of the SVM Light Wrapper, and which is needed to restore it when the model is loaded again. This file does not, however, contain any information about the SVM Light model itself. If an SVM Light model exists at the time of saving, and that model is up to date with respect to the current state of the training data, then it will be saved as a separate file, with the same name as the file containing information about the state of the wrapper, but with .NativePart appended to the filename. These files are in the standard SVM Light model format, and can be used with SVM Light when it is run from the command line. When a model is reloaded by GATE, both of these files must be available, and in the same directory, otherwise an error will result. However, if an up to date trained model does not exist at the time the model is saved, then only one file will be created upon saving, and only that file is required when the model is reloaded. So long as at least one training instance exists, it is possible to bring the model up to date at any point simply by classifying one or more instances (i.e. running the model with the training parameter set to false).

Options for the SVM Light engine

Only one <OPTIONS> subelement is currently supported:

9.25 MinorThird [#]

MinorThird is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text. It was written primarily by William W. Cohen, a professor at Carnegie Mellon University in the Center for Automated Learning and Discovery, though contributions have been made by many other colleagues and students.

Minorthird’s toolkit of learning methods is integrated tightly with the tools for manually and programmatically annotating text. In Minorthird, a collection of documents are stored in a database called a ”TextBase”. Logical assertions about documents in a TextBase can be made, and stored in a special ”TextLabels” object. ”TextLabels” are a type of ”stand off annotation” — unlike XML markup (for instance), the annotations are completely independent of the text. This means that the text can be stored in its original form, and that many different types of (perhaps incompatible) annotations can be associated with the same TextBase.

Each TextLabels annotation asserts a category or property for a word, a document, or a subsequence of words. (In Minorthird, a sequence of adjacent words is called a ”span”.) As an example, these annotations might be produced by human labelers; they might be produced by a hand-writted program, or annotations by a learned program. TextLabels might encode syntactic properties (like shallow parses or part of speech tags) or semantic properties (like the functional role that entities play in a sentence). TextLabels can be nested, much like variable-binding environments can be nested in a programming language, which enables sets of hypothetical or temporary labels to be added in a local scope and then discarded.

Annotated TextBases are accessed in a single uniform way. However, they are stored in one of several schemes. A Minorthird ”repository” can be configured to hold a bunch of TextLabels and their associated TextBases.

Moderately complex hand-coded annotation programs can be implemented with a special-purpose annotation language called Mixup, which is part of Minorthird. Mixup is based on a the widely used notion of cascaded finite state transducers, but includes some powerful features, including a GUI debugging environment, escape to Java, and a kind of subroutine call mechanism. Mixup can also be used to generate features for learning algorithms, and all the text-based learning tools in Minorthird are closely integrated with Mixup. For instance, feature extractors used in a learned named-entity recognition package might call a Mixup program to perform initial preprocessing of text.

Minorthird contains a number of methods for learning to extract and label spans from a document, or learning to classify spans (based on their content or context within a document). A special case of classifying spans is classifying entire documents. Minorthird includes a number of state-of-the-art sequential learning methods (like conditional random fields, and discriminative training methods for training hidden Markov models).

One practical difficulty in using learning techniques to solve NLP problems is that the input to learners is the result of a complex chain of transformations, which begin with text and end with very low-level representations. Verifying the correctness of this chain of derivations can be difficult. To address this problem, Minorthird also includes a number of tools for visualizing transformed data and relating it to the text from which it was derived.

More information about MinorThird can be found at http://minorthird.sourceforge.net/.

9.26 MIAKT NLG Lexicon [#]


PIC

Figure 9.15: Example NLG Lexicon

In order to lower the overhead of NLG lexicon development, we have created graphical tools for editing, storage, and maintenance of NLG lexicons, combined with a model which connects lexical entries to concepts and instances in the ontology. GATE also provides access to existing general-purpose lexicons such as WordNet and thus enables their use in NLG applications.

The structure of the NLG lexicons is similar to that of WordNet. Each lexical entry has a lemma, sense number, and syntactic information associated with it (e.g., part of speech, plural form). Each lexical entry also belongs to a synonym set or synset, which groups together all word senses which are synonymous. For example, as shown in Figure 9.15, the lemma “Magnetic Resonance Imaging scan” has one sense, its part of speech is noun, and it belongs to the synset containing also the first sense of the “MRI scan” lemma. Each synset also has a definition, which is shown in order to help the user when choosing the relevant synset for new word senses.

When the user adds a new lemma to the lexicon, it needs to be assigned to an existing synset. The editor also provides functionality for creating a new synset with part of speech and definition. (see Figure 9.16).


PIC

Figure 9.16: Editing synset information

The advantage of a synset-based lexicon is that while there can be a one-to-one mapping between concepts and instances in the ontology and synsets, the generator can still use different lexicalisations by choosing them among those listed in the synset (e.g., MRI or Magnetic Resonance Imaging). In other words, synsets effectively correspond to concepts or instances in the ontology and their entries specify possible lexicalisations of these concepts/instances in natural language.

At present, the NLG lexicon encodes only synonymy, while other non-lexical relations present in WordNet like hypernymy and hyponymy (i.e., superclass and subclass relations) are instead derived from the ontology, using the mapping between the synsets and concepts/instances. The reason behind this architectural choice comes from the fact that ontology-based generators ultimately need to use the ontology as the knowledge source. In this framework, the role of the lexicon is to provide lexicalisations for the ontology classes and instances.

9.26.1 Complexity and Generality [#]

The lexicon model was kept as generic as possible by making it incorporate only minimal lexical information. Additional, generator-specific information can be stored in a hash table, where values can be retrieved by their name. Since these are generator specific, the current lexicon user interface does not support editing of this information, although it can be accessed and modified programmatically.

On the other hand, the NLG lexicon is based on synonym sets, so generators which subscribe to a different model of synonymy might be able to access GATE-based NLG lexicons only via a wrapper mapping between the two models.

Given that the lexicon structure follows the WordNet synset model, such a lexicon can potentially be used for language analysis, if the application only requires synonymy. Our NLG lexicon model does not support yet the richer set of relations in WordNet such as hypernymy, although it is possible to extend the current model with richer relations. Since we used the lexicon in conjunction with the ontology, such non-linguistic relations were instead taken from the ontology.

The NLG lexicon itself is also independent from the generator’s input knowledge and its format.

9.27 Kea - Automatic Keyphrase Detection [#]

Kea is a tool for automatic detection of key phrases developed at the University of Waikato in New Zealand. The home page of the project can be found at http://www.nzdl.org/Kea/.

This user guide section only deals with the aspects relating to the integration of Kea in GATE. For the inner workings of Kea, please visit the Kea web site and/or contact its authors.

In order to use Kea in GATE, the “Kea” plugin needs to be loaded using the plugins management console. After doing that, two new resource types are available for creation: the “KEA Keyphrase Extractor” (a processing resource) and the “KEA Corpus Importer” (a visual resource associated with the PR).

9.27.1 Using the “KEA Keyphrase Extractor” PR

Kea is based on machine learning and it needs to be trained before it can be used to extract keyphrases. In order to do this, a corpus is required where the documents are annotated with keyphrases. Corpora in the Kea format (where the text and keyphrases are in separate files with the same name but different extensions) can be imported into GATE using the “KEA Corpus Importer” tool. The usage of this tool is presented in a sub-section below.

Once an annotated corpus is obtained, the “KEA Keyphrase Extractor” PR can be used to build a model:

  1. load a “KEA Keyphrase Extractor”
  2. create a new “Corpus Pipeline” controller.
  3. set the corpus for the controller
  4. set the ‘trainingMode’ parameter for the PR to ‘true’
  5. run the application.

After these steps, the Kea PR contains a trained model. This can be used immediately by switching the ‘trainingMode’ parameter to ‘false’ and running the PR over the documents that need to be annotated with keyphrases. Another possiblity is to save the model for later use, by right-clicking on the PR name in the right hand side tree and choosing the ”Save model” option.

When a previously built model is availalbe, the training procedure does not need to be repeated, the exisiting model can be loaded in memory by selecting the “Load model” option in the PR’s pop-up menu.


PIC

Figure 9.17: Parameters used by the Kea PR


The Kea PR uses several parameters as seen in Figure 9.17:

document
The document to be processed.
inputAS
The input annotation set. This parameter is only relevant when the PR is running in training mode and it specifies the annotation set containing the keyphrase annotations.
outputAS
The output annotation set. This parameter is only relevant when the PR is running in application mode (i.e. when the ‘trainingMode’ parameter is set to false) and it specifies the annotation set where the generated keyphrase annotations will be saved.
minPhraseLength
the minimum length (in number of words) for a keyphrase.
minNumOccur
the minimum number of occurences of a phrase for it to be a keyphrase.
maxPhraseLength
the maximum length of a keyphrase.
phrasesToExtract
how many different keyphrases should be generated.
keyphraseAnnotationType
the type of annotations used for keyphrases.
dissallowInternalPeriods
should internal periods be dissallowed.
trainingMode
if ‘true’ the PR is running in training mode; otherwise it is running in application mode.
useKFrequency
should the K-frequency be used.

9.27.2 Using Kea corpora

The authors of Kea provide on the project web page a few manually annotated corpora that can be used for training Kea. In order to do this from within GATE, these corpora need to be converted to the format used in GATE (i.e. GATE documents with annotations). This is possible using the “KEA Corpus Importer” tool which is available as a visual resource associated with the Kea PR. The importer tool can be made visible by double-clicking on the Kea PR’s name in the resources tree and then selecting the “KEA Corpus Importer” tab, see Figure 9.18.


PIC

Figure 9.18: Options for the “KEA Corpus Importer”


The tool will read files from a given directory, converting the text ones into GATE documents and the ones containing keyphrases into annotations over the documents.

The user needs to specify a few values:

Source Directory
the directory containing the text and key files. This can be typed in or selected by pressing the folder button next to the text field.
Extension for text files
the extension used for text fiels (by default .txt).
Extension for keyphrase files
the extension for the files listing keyphrases.
Encoding for input files
the encoding to be used when reading the files.
Corpus name
the name for the GATE corpus that will be created.
Output annotaion set
the name for the anntoation set that will contain the keyphrases read from the input files.
Keyphrase annotation type
the type for the generated annotations.

9.28 Ontotext JapeC Compiler [#]

Note: the JapeC compiler does not currently support the new JAPE language features introduced in July–September 2008. If you need to use negation, the @length and @string accessors, the contextual operators within and contains, or any comparison operators other than ==, then you will need to use the standard JAPE transducer instead of JapeC.

Japec is an alternative implementation of the JAPE language which works by compiling JAPE grammars into Java code. Compared to the standard implementation, these compiled grammars can be several times faster to run. At Ontotext, a modified version of the ANNIE sentence splitter using compiled grammars has been found to run up to five times as fast as the standard version. The compiler can be invoked manually from the command line, or used through the “Ontotext Japec Compiler” PR in the Jape_Compiler plugin.

The “Ontotext Japec Transducer” (com.ontotext.gate.japec.JapecTransducer) is a processing resource that is designed to be an alternative to the original Jape Transducer. You can simply replace gate.creole.Transducer with com.ontotext.gate.japec.JapecTransducer in your gate application and it should work as expected.

The Japec transducer takes the same parameters as the standard JAPE transducer:

grammarURL
the URL from which the grammar is to be loaded. Note that the Japec Transducer will only work on file: URLs. Also, the alternative binaryGrammarURL parameter of the standard transducer is not supported.
encoding
the character encoding used to load the grammars.
ontology
the ontology used for ontolog-aware transduction.

Its runtime parameters are likewise the same as those of the standard transducer:

document
the document to process.
inputASName
name of the AnnotationSet from which input annotations to the transducer are read.
outputASName
name of the AnnotationSet to which output annotations from the transducer are written.

The Japec compiler itself is written in Haskell. Compiled binaries are provided for Windows, Linux (x86) and Mac OS X (PowerPC), so no Haskell interpreter is required to run Japec on these platforms. For other platforms, or if you make changes to the compiler source code, you can build the compiler yourself using the Ant build file in the Jape_Compiler plugin directory. You will need to install the latest version of the Glasgow Haskell Compiler4 and associated libraries. The japec compiler can then be built by running:

../../bin/ant japec.clean japec

from the Jape_Compiler plugin directory.

9.29 ANNIC [#]

ANNIC (ANNotations-In-Context) is a full-featured annotation indexing and retrieval system. It is provided as part of an extension of the Serial Data-stores, called Searchable Serial Data-store (SSD).

ANNIC can index documents in any format supported by the GATE system (i.e., XML, HTML, RTF, e-mail, text, etc). Compared with other such query systems, it has additional features addressing issues such as extensive indexing of linguistic information associated with document content, independent of document format. It also allows indexing and extraction of information from overlapping annotations and features. Its advanced graphical user interface provides a graphical view of annotation markups over the text, along with an ability to build new queries interactively. In addition, ANNIC can be used as a first step in rule development for NLP systems as it enables the discovery and testing of patterns in corpora.

ANNIC is built on top of the Apache Lucene5 – a high performance full-featured search engine implemented in Java, which supports indexing and search of large document collections. Our choice of IR engine is due to the customisability of Lucene. For more details on how Lucene was modified to meet the requirements of indexing and querying annotations, please refer to [Aswani et al. 05].

As explained earlier, SSD is an extension of the serial data-store. In addition to the persist location, SSD asks user to provide some more information (explained later) that it uses to index the documents. Once the SSD has been initiated, user can add/remove documents/corpora to the SSD in a similar way it is done with other data-stores. When documents are added to the SSD, it automatically tries to index them. It updates the index whenever there is a change in any of the documents stored in the SSD and removes the document from the index if it is deleted from the SSD. Be warned that only the annotation sets, types and features initially indexed will be updated when adding/removing documents to the datastore. This mean, for example, that if you add a new annotation type in one of the indexed document, it will not appear in the results when searching for it.

SSD has an advanced graphical interface that allows users to issue queries over the SSD. Below we explain the parameters required by SSD and how to instantiate it, how to use its graphical interface and how to use SSD from programmatically.

9.29.1 Instantiating SSD

Steps:

  1. Right click on “Data Stores” and select “Create datastore”.
  2. From a drop-down list select “Lucene Based Searchable DataStore”.
  3. Here, you will see an input window. Please provide these parameters:
    1. DataStore URL: Select an empty folder where the DS is created.
    2. Index Location: Select an empty folder. This is where the index will be created.
    3. Annotation Sets: Here, you can provide one or more annotation sets that you wish to index or exclude from being indexed. In order to be able to index the default annotation set, you must click on the edit list icon and add an empty field to the list. If there are no annotation sets provided, all the annotation sets in all documents are indexed. In addition to all annotation sets a new combined annotation set is created in memory which is a union of all annotations from all the annotation sets of the document being indexed. This set is also indexed in order to allow users to issue queries across various annotation sets.
    4. Base-Token Type: (e.g. Token or Key.Token) These are the basic tokens of any document. Your documents must have the annotations of Base-Token-Type in order to get indexed. These basic tokens are used for displaying contextual information while searching patterns in the corpus. In case of indexing more than one annotation set, user can specify the annotation set from which the tokens should be taken (e.g. Key.Token- annotations of type Token from the annotation set called Key). In case user does not provide any annotation set name (e.g. Token), the system searches in all the annotation sets to be indexed and the base-tokens from the first annotation set with the base token annotations are taken. Please note that the documents with no base-tokens are not indexed. However, if the ”create tokens automatically” option is selected, the SSD creates base-tokens automatically. Here, each string delimited with white space is considered as a token.
    5. Index Unit Type: (e.g. Sentence, Key.Sentence) This specifies the unit of Index. In other words, annotations lying within the boundaries of these annotations are indexed (e.g. in the case of “Sentences”, no annotations that are spanned across the boundaries of two sentences are considered for indexing). User can specify from which annotation set the index unit annotations should be considered. If user does not provide any annotation set, the SSD searches among all annotation sets for index units. If this field is left empty or SSD fails to locate index units, the entire document is considered as a single unit.
    6. Features: Finally, users can specify the annotation types and features that should be indexed or excluded from being indexed. (e.g. SpaceToken and Split). If user wants to exclude only a specific feature of a specific annotation type, he/she can specify it using a ’.’ separator between the annotation type and its feature (e.g. Person.matches).
  4. Click OK. If all parameters are OK, a new empty DS will be created.
  5. Create an empty corpus and save it to the SSD.
  6. Populate it with some documents. Each document added to the corpus and eventually to the SSD is indexed automatically. If the document does not have the required annotations, that document is skipped and not indexed.

9.29.2 Search GUI


PIC

Figure 9.19: Searchable Serial Datastore Viewer.


Overview

Figure 9.19 gives a snapshot of the GUI. The top section contains a text area to write a query, options to select the input data and the output format and two icons to execute and delete a query. The central section shows a graphical visualisation of annotations and values of the result selected in the bottom results table. You can also see the annotation rows manager window where you define which annotation type and feature to display in the central section. The bottom section contains the results table of the query, i.e. the text that matches the query with their left and right contexts, the annotation set and the document. The bottom section contains also a tabbed panes of statistics.

Syntax of queries

SSD enables you to formulate versatile queries using JAPE patterns. JAPE patterns support various query formats. Below we give a few examples of JAPE pattern clauses which can be used as SSD queries. Actual queries can also be a combination of one or more of the following pattern clauses:

  1. String
  2. {AnnotationType}
  3. {AnnotationType == String}
  4. {AnnotationType.feature == feature value}
  5. {AnnotationType1, AnnotationType2.feature == featureValue}
  6. {AnnotationType1.feature == featureValue, AnnotationType2.feature == featureValue}

JAPE patterns also support the | (OR) operator. For instance, {A} ({B}|{C}) is a pattern of two annotations where the first is an annotation of type A followed by the annotation of type either B or C. ANNIC supports two operators, + and *, to specify the number of times a particular annotation or a sub pattern should appear in the main query pattern. Here, ({A})+n means one and up to n occurrences of annotation {A} and ({A})*n means zero or up to n occurrences of annotation {A}.

Below we explain the steps to search in SDS.

  1. Double click on SSD. You will see an extra tab “Lucene DataStore Searcher”. Click on it to activate the searcher GUI.
  2. Here you can specify a query to search in your SSD. The query here is a L.H.S. part of the JAPE grammar. Please refer to the following example queries:
    1. {Person} – This will return annotations of type Person from the SSD
    2. {Token.string == “Microsoft”} – This will return all occurrences of “Microsoft” from the SSD.
    3. {Person}({Token})*2{Organization} – Person followed by zero or upto two tokens followed by Organization.
    4. {Token.orth==“upperInitial”, Organization} – Token with feature orth with value set to “upperInitial” and which is also annotated as Organization.


PIC

Figure 9.20: Searchable Serial Datastore Viewer - Auto-completion.


Top section

A text-area located in the top left part of the GUI is used to input a query. You can copy/cut/paste with Control+C/X/V, undo/redo your changes with Control+Z/Y as usual. To add a new line, use Control+Enter combination keys.

Auto-completion shown on the figure 9.20 for annotation type is triggered when typing ’{’ and for feature when typing ’.’ after a valid annotation type. It shows only the annotation types and features related to the selected corpus and annotation set. If you right-click on an expression it will automatically select the shortest valid enclosing brace and if you click on a selection it will propose you to add quantifiers for allowing the expression to appear zero, one or more times.

To execute the query, click on the magnifying glass icon, use Enter key or Alt+Enter combination keys. To delete the query, click on the trash icon or use Alt+Backspace combination keys.

It is possible to have more than one corpus, each containing a different set of documents, stored in a single data-store. ANNIC, by providing a drop down box with a list of stored corpora, also allows searching within a specific (selected) corpus. Similarly a document can have more than one annotation set indexed and therefore ANNIC also provides a drop down box with a list of indexed annotation sets for the selected corpus.

A large corpus can have many hits for a given query. This may take a long time to refresh the GUI and may create inconvenience while browsing through patterns. ANNIC therefore allows you to specify a number of patterns that you wish to retrieve at once and provides a way to iterate through next pages with the Next Page of Results button. Due to technical complexities, it is not possible to visit a previous page. It is however possible to tick a check-box for retrieving all the results at the same time.

Central section

Annotation types and features to show can be selected from the annotation rows manager in clicking on the Modify Rows button in the central section. When you choose to show a feature of an annotation (e.g. feature category for annotation type Token), the central section shows colored rectangles where the annotation type are existing containing values of those features. When you choose to show only one annotation type in letting the feature column empty then all its features are displayed with empty rectangles that show their features values in a pop-up window when the mouse is over the rectangles.

Shortcuts are expression that stand for an ”AnnotationType.Feature” expression. For example, on the figure 9.19, the shortcut ”POS” stands for the expression ”Token.category”. The purpose is to make the query more readable.

When you left-clicks on any of the rectangles of the annotations rows, the respective query expression is placed at the caret position in the query text area or replace the selected expression, if any. You can also click on a word on the first line to add it to the query.

Bottom section

In the table of results, ANNIC shows each pattern retrieve from the last query executed on a row and provides a tool tip that shows the query that the selected pattern refers to.

Along with its left and right context texts, it also lists the names of the document and the annotation set that the patterns come from. When the focus changes from one row to another, the central section is updated accordingly. You can sort a table column in clicking on its header.

You can remove a result from the results table or open the document containing it in right-clicking on a result in the results table.

ANNIC provides an Export button to export in an HTML file all the results or only the selected results.

A statistics tabbed pane can be displayed on the bottom-right when clicking on the Statistics button. There is always a global statistics pane that list the count the occurrences of all annotation types for the selected corpus and annotation set.

Statistics can be obtain in 16 different ways for the datastore, matched spans of the query in the results, with or without contexts and for an annotation type, a annotation type + feature or an annotation type + feature + value. A second pane contains the one item statistics that you can add in right-clicking on a non empty rectangle or on the header of a row in the central section. You can sort a table column in clicking on its header.

9.29.3 Using SSD from your code

 
//how to instantiate a searchabledatastore  
===============================  
 
// create an instance of datastore  
LuceneDataStoreImpl ds = (LuceneDataStoreImpl)  
Factory.createDataStore(‘‘gate.persist.LuceneDataStoreImpl’’, dsLocation);  
 
// we need to set Indexer  
Indexer indexer = new LuceneIndexer(new URL(indexLocation));  
 
// set the parameters  
Map parameters = new HashMap();  
 
// specify the index url  
parameters.put(Constants.INDEX_LOCATION_URL, new URL(indexLocation));  
 
// specify the base token type  
// and specify that the tokens should be created automatically  
// if not found in the document  
parameters.put(Constants.BASE_TOKEN_ANNOTATION_TYPE, ‘‘Token’’);  
parameters.put(Constants.CREATE_TOKENS_AUTOMATICALLY, new Boolean(true));  
 
// specify the index unit type  
parameters.put(Constants.INDEX_UNIT_ANNOTATION_TYPE, ‘‘Sentence’’);  
 
// specifying the annotation sets "Key" and "Default Annotation Set"  
// to be indexed  
List<String> setsToInclude = new ArrayList<String>();  
setsToInclude.add("Key");  
setsToInclude.add("<null>");  
parameters.put(Constants.ANNOTATION_SETS_NAMES_TO_INCLUDE, setsToInclude);  
parameters.put(Constants.ANNOTATION_SETS_NAMES_TO_EXCLUDE, new ArrayList<String>());  
 
// all features should be indexed  
parameters.put(Constants.FEATURES_TO_INCLUDE, new ArrayList<String>());  
parameters.put(Constants.FEATURES_TO_EXCLUDE, new ArrayList<String>());  
 
// set the indexer  
ds.setIndexer(indexer, parameters);  
 
// set the searcher  
ds.setSearcher(new LuceneSearcher());  
 
//how to search in this datastore  
//======================  
// obtain the searcher instance  
Searcher searcher = ds.getSearcher();  
Map parameters  = new HashMap();  
 
// obtain the url of index  
String indexLocation =  
new File(((URL) ds.getIndexer().getParameters().get(Constants.INDEX_LOCATION_URL))  
.getFile()).getAbsolutePath();  
ArrayList indexLocations = new ArrayList();  
indexLocations.add(indexLocation);  
 
 
// corpus2SearchIn = mention corpus name that was indexed here.  
 
// the annotation set to search in  
String annotationSet2SearchIn = "Key";  
 
// set the parameter  
parameters.put(Constants.INDEX_LOCATIONS,indexLocations);  
parameters.put(Constants.CORPUS_ID, corpus2SearchIn);  
parameters.put(Constants.ANNOTATION_SET_ID, annotationSet);  
parameters.put(Constants.CONTEXT_WINDOW, contextWindow);  
parameters.put(Constants.NO_OF_PATTERNS, noOfPatterns);  
 
// search  
String query = ‘‘{Person}’’;  
Hit[] hits = searcher.search(query, parameters);  

9.30 Annotation Merging [#]

If we got annotations about the same subject on the same document from different annotators, we need to merge those annotations to form a unified annotations. The merging is applied to the annotations with the same annotation type but in different annotation sets of the same document. We implemented two approaches for annotation merging. The first method takes a parameter numMinK and selects the annotation on which at least numMinK annotators agree. If two or more merged annotations have the same span, then the annotation with the most supporters is kept and other annotations with the same span are discarded. The second method selects one annotation from those annotations with the same span, which the majority of the annotators support. Note that if one annotator did not create the annotation with the particular span, we count it as one non-support of the annotation with the span. If it turns out that the majority of the annotators did not support the annotation with that span, then no annotation with the span would be put into the merged annotations.

9.30.1 Two implemented methods

The following two static methods in the class gate.util.AnnotationMerging are for the merging methods. The two methods have very similar input and output parameters. Each of the methods takes an array of annotation sets, which should be the same annotation type on the same document from different annotators, as input. If there is one annotation feature indicating the annotation label, the name of the annotation feature is another input. Otherwise, set the input parameter for the annotation feature as null. The output is a map the key of which is one merged annotation and the value of which represents the annotators (in terms of the indices of the array of annotation sets) who support the annotation. The method also has a boolean input parameter to indicate if or not the annotations from different annotators are based on the same set of instances, which can be determined by the static method public boolean isSameInstancesForAnnotators(AnnotationSet[] annsA) in the class gate.util.IaaCalculation. One instance corresponds to all the annotations with the same span. If the annotation sets are based on the same set of instances, the merging methods will ensure that the merged annotations are on the same set of instances.

9.30.2 Annotation Merging Plugin

The annotation merging methods are wrapped in the plugin such that they can be used in a PR in the GATE GUI. The plugin can be used as a PR in an application of pipeline or corpus pipeline. To use the PR, each document in the pipeline or the corpus pipeline should have the annotation sets for merging. The annotation merging PR has no loading parameter but has several run-time parameters specifying the annotation merging task, explained in the following.

9.31 OntoRoot Gazetteer [#]

OntoRoot Gazetteer is a type of a dynamically created gazetteer that is, in combination with few other generic GATE resources, capable of producing ontology-aware annotations over the given content with regards to given ontology. This gazetteer is a part of Ontology_Based_Gazetteer plugin that has been developed as a part of TAO project.

9.31.1 How does it work? [#]

To produce ontology-aware annotations i.e. annotations that link to the specific concepts or relations from the ontology, it is essential to pre-process the Ontology Resources (e.g., Classes, Instances, Properties) and extract their human-understandable lexicalisations.

As a precondition for extracting human-understandable content from the ontology first a list of the following is being created:

Each item from the list is further processed so that:

Each item from this list is analysed separately by the Onto Root Application (ORA) on execution (see figure 9.21). The Onto Root Application first tokenises each linguistic term, then assigns part-of-speech and lemma information to each token.


PIC   

Figure 9.21: Building Ontology Resource Root (OntoRoot) Gazetteer from the Ontology


As a result of that pre-processing, each token in the terms will have additional feature named ’root’, which contains the lemma as created by the morphological analyser. It is this lemma or a set of lemmas which are then added to the dynamic gazetteer list, created from the ontology.

For instance, if there is a resource with a short name (i.e., fragment identifier) ProjectName, without any assigned properties the created list before executing the OntoRoot gazetteer collection will contain following the strings:

Each of the item from the list is then analysed separately and the results would be the same as the input strings, as all of entries are nouns given in singular form.

9.31.2 Initialisation of OntoRoot Gazetteer [#]

To initialise the gazetteer there are few mandatory parameters:

and few optional ones:

9.32 Chinese Word Segmentation [#]

Unlike English, Chinese text does not have a symbol (or delimiter) such as blank space to explicitly separate a word from the surrounding words. Therefore, for automatic Chinese text processing, we may need a system to recognise the words in Chinese text, a problem known as Chinese word segmentation. The plugin described in this section performs the task of Chinese word segmentation. It is based on our work using the Perceptron learning algorithm for Chinese word segmentation task of the Sighan 20057. [Li et al. 05c]. Our Perceptron based system has achieved very good performance in the Sighan-05 task.

The plugin has the name as ChineseSegmenter and is available in the GATE distribution. The corresponding processing resource’s name is Chinese Segmenter PR. Once you load the PR into GATE, you may put it into a Pipeline application8. The plugin can be used to learn a model from the segmented Chinese text as training data. It can also use the learned model to segment Chinese text. The plugin can use different learning algorithms to learn different models. It can deal with different codes for Chinese text, such as UTF-8, GB2312 or BIG5. All those difference options can be selected by setting the run-time parameters of the plugin.

The plugin has five run-time parameters, which are described in the following.

The following PAUM models are available for the plugin and can be downloaded from the website http://www.dcs.shef.ac.uk/~yaoyong/. In detail, those models were learned using the PAUM learning algorithm from the corpora provided by Sighan-05 bakeoff task.

As you can see, those models were learned using different training data and different Chinese text codes of the same training data. The PKU training data are the news articles published in the mainland China and use the simplified Chinese, while the AS training data are the news articles published in Taiwan and use the traditional Chinese. Hence, if your text are in the simplified Chinese, you can use the models trained by the PKU data. For example, if your text are in the traditional Chinese, you need use the models trained by the AS data. If your data are in GB2312 code or any compatible code, you need use the model trained by the corpus in GB2312 code.

Note that the segmented Chinese text (either used as training data or produced by this plugin) use the blank space to separate a word from its surrounding words. Hence, if your data are in the Unicde such as UTF-8, you can use the GATE Unicode Tokeniser to process the segmented text to add the Token annotations into your text to represent the Chinese words. Once you get the annotations for all the Chinese words, you can perform further processing such as POS tagging and named entity recogntion.

9.33 Copying Annotations Between Documents [#]

Sometimes a document has two copies, each of which was annotated by two annotators for the same task. Then we want to copy the annotations in one copy to another copy of document, in order to save them using less memory or to use the annotation merging plugin or IAA plugin to process them. This plugin does exactly this task – it copies the specified annotations from one document to another document.

The plugin is named as copyAS2AnoDoc and is available with the GATE distribution. When loading the plugin into GATE, it represented as the processing resource Copy Anns to Another Doc PR. You need to put the PR into a Corpus Pipeline to use it. The plugin does not have any initialisation parameter. It has several run-time parameters, which specify the annotations to be copied, the source documents and target documents. In detail, the run-time parameters are:

The Corpus parameter of the Corpus Pipeline application containing the plugin specifies a corpus which contains the target documents. Given one (target) document in the corpus, the plugin tries to find a source document in the source directory specified by the parameter sourceFilesURL, according to the similarity of the names of the source and target documents. The similarity of two file names is calculated by comparing the two strings of names from the start to the end of the strings. The two names have greater similarity if they share more characters from the beginning of the strings. For example, suppose two target documents have the names aabcc.xml and abcab.xml and the three source files have the names abacc.xml, abcbb.xml and aacc.xml, respectively. Then the target document aabcc.xml has the corresponding source document aacc.xml, and abcab.xml has the corresponding source document abcbb.xml. The plugin should copy the annotations within the document if the source and target directories are the same.

1shef.nlp.supple.prolog.SICStusProlog exists for backwards compatibility and behaves the same as SICStusProlog3.

2resources/englishPCFG.ser.gz

3resources/english-tag-map.txt

4GHC version 6.4.1 was used to build the supplied binaries for Windows, Linux and Mac

5http://lucene.apache.org

6An ontology resource is usually identified by an URI concatenated with a set of characters starting with ’#’. This set of characters is called fragment identifier. For example, if the URI of a class representing GATE POS Tagger is: ’http://gate.ac.uk/ns/gate-ontology#POSTagger’, the fragment identifier will be ’POSTagger’.

7See http://www.sighan.org/bakeoff2005/ for the Sighan-05 task

8You may put the plugin into a Corpus Pipeline. Note that the plugin does not process the documents contained in the corpus which is assigned to the corpus pipeline. In stead, it will process the documents in a directory specified by a run-time parameter. However, you still need to make sure that the assigned corpus contains at least one document in order to allow the plugin process the specified documents.