Chapter 21
More (CREOLE) Plugins [#]
For the previous reader was none other than myself. I had already read this book long ago.
The old sickness has me in its grip again: amnesia in litteris, the total loss of literary memory. I am overcome by a wave of resignation at the vanity of all striving for knowledge, all striving of any kind. Why read at all? Why read this book a second time, since I know that very soon not even a shadow of a recollection will remain of it? Why do anything at all, when all things fall apart? Why live, when one must die? And I clap the lovely book shut, stand up, and slink back, vanquished, demolished, to place it again among the mass of anonymous and forgotten volumes lined up on the shelf.
…
But perhaps - I think, to console myself - perhaps reading (like life) is not a matter of being shunted on to some track or abruptly off it. Maybe reading is an act by which consciousness is changed in such an imperceptible manner that the reader is not even aware of it. The reader suffering from amnesia in litteris is most definitely changed by his reading, but without noticing it, because as he reads, those critical faculties of his brain that could tell him that change is occurring are changing as well. And for one who is himself a writer, the sickness may conceivably be a blessing, indeed a necessary precondition, since it protects him against that crippling awe which every great work of literature creates, and because it allows him to sustain a wholly uncomplicated relationship to plagiarism, without which nothing original can be created.
Three Stories and a Reflection, Patrick Suskind, 1995 (pp. 82, 86).
This chapter describes additional CREOLE resources which do not form part of ANNIE, and have not been covered in previous chapters.
21.1 Verb Group Chunker [#]
The rule-based verb chunker is based on a number of grammars of English [Cobuild 99, Azar 89]. We have developed 68 rules for the identification of non recursive verb groups. The rules cover finite (’is investigating’), non-finite (’to investigate’), participles (’investigated’), and special verb constructs (’is going to investigate’). All the forms may include adverbials and negatives. The rules have been implemented in JAPE. The finite state analyser produces an annotation of type ‘VG’ with features and values that encode syntactic information (‘type’, ‘tense’, ‘voice’, ‘neg’, etc.). The rules use the output of the POS tagger as well as information about the identity of the tokens (e.g. the token ‘might’ is used to identify modals).
The grammar for verb group identification can be loaded as a Jape grammar into the GATE architecture and can be used in any application: the module is domain independent. The grammar file is located within the ANNIE plugin, in the directory plugins/ANNIE/resources/VP.
21.2 Noun Phrase Chunker [#]
The NP Chunker application is a Java implementation of the Ramshaw and Marcus BaseNP chunker (in fact the files in the resources directory are taken straight from their original distribution) which attempts to insert brackets marking noun phrases in text which have been marked with POS tags in the same format as the output of Eric Brill’s transformational tagger. The output from this version should be identical to the output of the original C++/Perl version released by Ramshaw and Marcus.
For more information about baseNP structures and the use of transformation-based learning to derive them, see [Ramshaw & Marcus 95].
21.2.1 Differences from the Original
The major difference is the assumption is made that if a POS tag is not in the mapping file then it is tagged as ‘I’. The original version simply failed if an unknown POS tag was encountered. When using the GATE wrapper the chunk tag can be changed from ‘I’ to any other legal tag (B or O) by setting the unknownTag parameter.
21.2.2 Using the Chunker
The Chunker requires the Creole plugin ‘Parser_NP_Chunking’ to be loaded. The two loadtime parameters are simply urls pointing at the POS tag dictionary and the rules file, which should be set automatically. There are five runtime parameters which should be set prior to executing the chunker.
- annotationName: name of the annotation the chunker should create to identify noun phrases in the text.
- inputASName: The chunker requires certain types of annotations (e.g. Tokens with part of speech tags) for identifying noun chunks. This parameter tells the chunker which annotation set to use to obtain such annotations from.
- outputASName: This is where the results (i.e. new noun chunk annotations will be stored).
- posFeature: Name of the feature that holds POS tag information. ’
- unknownTag: it works as specified in the previous section.
The chunker requires the following PRs to have been run first: tokeniser, sentence splitter, POS tagger.
21.3 TaggerFramework [#]
The Tagger Framework is an extension of work originally developed in order to provide support for the TreeTagger plugin within GATE. Rather than focusing on providing support for a single external tagger this plugin provides a generic wrapper that can easily be customised (no Java code is required) to incorporate many different taggers within GATE.
The plugin currently provides example applications (see plugins/Tagger_Framework/resources) for the following taggers: GENIA (a biomedical tagger), Hunpos (providing support for English and Hungarian), TreeTagger (supporting German, French, Spanish and Italian as well as English), and the Stanford Tagger (supporting English, German and Arabic).
The basic idea behind this plugin is to allow the use of many external taggers. Providing such a generic wrapper requires a few assumptions. Firstly we assume that the external tagger will read from a file and that the contents of this file will be one annotation per line (i.e. one token or sentence per line). Secondly we assume that the tagger will write it’s response to stdout and that it will also be based on one annotation per line – although there is no assumption that the input and output annotation types are the same.
An important issue with most external taggers is tokenisation: Generally, when using a native GATE tagger in a pipeline, “Token” annotations are first generated by a tokeniser, and then processed by a POS tagger. Most external taggers, on the other hand, have built-in code to perform their own tokenisation. In this case, there are generally two options: (1) use the tokens generated by the external tagger and import them back into GATE (typically into a “Token” annotation type). Or (2), if the tagger accepts pre-tokenised text, the Tagger Framework can be configured to pass the annotations as generated by a GATE tokeniser to the external tagger. For details on this, please refer to the ‘updateAnnotations’ runtime parameter described below. However, if the tokenisation strategies are significantly different, this may lead to a degradation of the tagger’s performance.
- Initialization Parameters
- preProcessURL: The URL of a JAPE grammar that should be run over each document before running the tagger.
- postProcessURL: The URL of a JAPE grammar that should be run over each document after running the tagger. This can be used, for example, to add chunk annotations using IOB tags output by the tagger and stored as features on Token annotations.
- Runtime Parameters
- debug: if set to true then a whole heap of useful information will be printed to the messages tab as the tagger runs. Defaults to false.
- encoding: this must be set to the encoding that the tagger expects the input/output files to use. If this is incorrectly set is highly likely that either the tagger will fail or the results will be meaningless. Defaults to ISO-8859-1 as this seems to be the most commonly required encoding.
- failOnUnmappableCharacter: What to do if a character is encountered in the document which cannot be represented in the selected encoding. If the parameter is true (the default), unmappable characters cause the wrapper to throw an exception and fail. If set to false, unmappable characters are replaced by question marks when the document is passed to the tagger. This is useful if your documents are largely OK but contain the odd character from outside the Latin-1 range.
- inputTemplate: template string describing how to build the line of input for the tagger corresponding to a single annotation. The template contains placeholders of the form ${feature} which will be replaced by the value of the corresponding feature from the annotation. The default template is ${string}, which simply passes the string feature of each annotation to the tagger. Typical variants would be ${string}\t${category} for an entity tagger that requires the string and the part of speech tag for each token, separated by a tab1. If a particular annotation does not have one of the specified features, the corresponding slot in the template will be left blank (i.e. replaced by an empty string). It is only an error if a particular annotation contains none of the features specified by the template.
- regex: this should be a Java regular expression that matches a single line in the output from the tagger. Capturing groups should be used to define the sections of the expression which match the useful output.
- featureMapping: this is a mapping from feature name to capturing group in the regular expression. Each feature will be added to the output annotations with a value equal to the specified capturing group. For example, the TreeTagger uses a regular expression (.+)\t(.+)\t(.+) to capture the three column output. This is then combined with the feature mapping {string=1, category=2, lemma=3} to add the appropriate feature/values to the output annotations.
- inputASName: the name of the annotation set which should be used for input. If not specified the default (i.e. un-named) annotation set will be used.
- inputAnnotationType: the name of the annotation used as input to the tagger. This will usually be Token. Note that the input annotations must contain a string feature which will be used as input to the tagger. Tokens usually have this feature but if, for example, you wish to use Sentence as the input annotation then you will need to add the string feature. JAPE grammars for doing this are provided in plugins/Tagger_Framework/resources.
- outputASName: the name of the annotation set which should be used for output. If not specified the default (i.e. un-named) annotation set will be used.
- outputAnnotationType: the name of the annotation to be provided as output. This is usually Token.
- taggerBinary: a URL indicating the location of the external tagger. This is usually a shell script which may perform extra processing before executing the tagger. The plugins/Tagger_Framework/resources directory contains example scripts (where needed) for the supported taggers. These scripts may need editing (for example, to set the installation directory of the tagger) before they can be used.
- taggerDir: the directory from which the tagger must be executed. This can be left unspecified.
- taggerFlags: an ordered set of flags that should be passed to the tagger as command line options
- updateAnnotations: If set to true then the plugin will attempt to update existing output annotations. This can fail if the output from the tagger and the existing annotations are created differently (i.e. the tagger does its own tokenization). Setting this option to false will make the plugin create new output annotations, removing any existing ones, to prevent the two sets getting out of sync. This is also useful when the tagger is domain specific and may do a better job than GATE. For example, the GENIA tagger is better at tokenising biomedical text than the ANNIE tokeniser. Defaults to true.
By default the GenericTagger PR simply tries to execute the taggerBinary using the normal Java Runtime.exec() mechanism. This works fine on Unix-style platforms such as Linux or Mac OS X, but on Windows it will only work if the taggerBinary is a .exe file. Attempting to invoke other types of program fails on Windows with a rather cryptic “error=193”.
To support other types of tagger programs such as shell scripts or Perl scripts, the GenericTagger PR supports a Java system property shell.path. If this property is set then instead of invoking the taggerBinary directly the PR will invoke the program specified by shell.path and pass the tagger binary as the first command-line parameter.
If the tagger program is a shell script then you will need to install the appropriate interpreter, such as sh.exe from the cygwin tools, and set the shell.path system property to point to sh.exe. For GATE Developer you can do this by adding the following line to build.properties (see Section 2.3, and note the extra backslash before each backslash and colon in the path):
Similarly, for Perl or Python scripts you should install a suitable interpreter and set shell.path to point to that.
You can also run taggers that are invoked using a Windows batch file (.bat). To use a batch file you do not need to use the shell.path system property, but instead set the taggerBinary runtime parameter to point to C:\WINDOWS\system32\cmd.exe and set the first two taggerFlags entries to “/c” and the Windows-style path to the tagger batch file (e.g. C:\MyTagger\runTagger.bat). This will cause the PR to run cmd.exe /c runTagger.bat which is the way to run batch files from Java.
In general most of the complexities of configuring a number of external taggers has already been determined and example pipelines are provided in the plugin’s resources directory. To use one of the supported taggers simply load one of the exampl applications and then check the runtime parameters of the Tagger_Framework PR in order to set paths correctly to your copy of the tagger you wish to use.
Some taggers require more complex configuration, details of which are covered in the remainder of this section.
21.3.1 TreeTagger - Multilingual POS Tagger
The TreeTagger is a language-independent part-of-speech tagger, which supports a number of different languages through parameter files, including English, French, German, Spanish, Italian and Bulgarian. Originally made available in GATE through a dedicated wrapper, it is now fully supported through the Tagger Framework. You must install the TreeTagger separately from
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
Avoid installing it in a directory that contains spaces in its path.
Tokenisation and Command Scripts. When running the TreeTagger through the Tagger Framework, you can choose between passing Tokens generated within GATE to the TreeTagger for POS tagging or let the TreeTagger perform tokenisation as well, importing the generated Tokens into GATE annotations. If you need to pass the Tokens generated by GATE to the TreeTagger, it is important that you create your own command scripts to skip the tokenisation step done by default in the TreeTagger command scripts (the ones in the TreeTagger’s cmd directory). A few example scripts for passing GATE Tokens to the TreeTagger are available under plugins/Tagger_Framework/resources/TreeTagger, for example, tree-tagger-german-gate runs the German parameter file with existing “Token” annotations.
Note that you must set the paths in these command files to point to the location where you installed the TreeTagger:
CMD=/usr/local/durmtools/TreeTagger/cmd
LIB=/usr/local/durmtools/TreeTagger/lib
The Tagger Framework will run the TreeTagger on any platform that supports the TreeTagger tool, including Linux, Mac OS X and Windows, but the GATE-specific scripts require a POSIX-style Bourne shell with the gawk, tr and grep commands, plus Perl for the Spanish tagger. For Windows this means that you will need to install the appropriate parts of the Cygwin environment from http://www.cygwin.com and set the system property treetagger.sh.path to contain the path to your sh.exe (typically C:\cygwin\bin\sh.exe).
POS Tags. For English the POS tagset is a slightly modified version of the Penn Treebank tagset, where the second letter of the tags for verbs distinguishes between ‘be’ verbs (B), ‘have’ verbs (H) and other verbs (V).
The tagsets for other languages can be found on the TreeTagger web site. Figure 21.1 shows a screenshot of a French document processed with the TreeTagger.
Potential Lemma Problems Sometimes the TreeTagger is either completely unable to determine the correct lemma, or may return multiple lemma for a token (separated by a |). In these cases any further processing that relies on the lemma feature (for example, the flexible gazetteer) may not function correctly. Both problems can be alleviated somewhat by using the resources/TreeTagger/fix-treetagger-lemma.jape JAPE grammar. This can be used either as a standalone grammar or as the post-process initialization feature of the Tagger_Framework PR.
21.3.2 GENIA and Double Quotes [#]
Documents that contain double quote characters can cause problems for the GENIA tagger. The issue arises because the in-built GENIA tokenizer converts double quotes to single quotes in the output which then do not match the document content, causing the tagger to fail. There are two possible solutions to this problem.
Firstly you can perform tokenization in GATE and disable the in-built GENIA tokenizer. Such a pipeline is provided as an example in the GENIA resources direcotry; geniatagger-en-no_tokenization.gapp. However, this may result in other problems for your subsequent code. If so, you may want to try the second solution.
The second solution is to use the GENIA tokenization via the other provided example pipeline: geniatagger-en-tokenization.gapp. If your documents do not contain double quotes then this gapp example should work as is. Otherwise, you must modify the GENIA tagger in order not to convert double quotes to single quotes. Fortunately this is fairly straightforward. In the resources directory you will find a modified copy of tokenize.cpp from v3.0.1 of the GENNIA tagger. Simply use this file to replace the copy in the normal GENIA distribution and recompile. For Windows users, a pre-compiled binary is also provided – simply replace your existing binary with this modified copy.
21.4 Chemistry Tagger [#]
This GATE module is designed to tag a number of chemistry items in running text. Currently the tagger tags compound formulas (e.g. SO2, H2O, H2SO4 ...) ions (e.g. Fe3+, Cl-) and element names and symbols (e.g. Sodium and Na). Limited support for compound names is also provided (e.g. sulphur dioxide) but only when followed by a compound formula (in parenthesis or commas).
21.4.1 Using the Tagger
The Tagger requires the Creole plugin ‘Tagger_Chemistry’ to be loaded. It requires the following PRs to have been run first: tokeniser and sentence splitter (the annotation set containing the Tokens and Sentences can be set using the annotationSetName runtime parameter). There are four init parameters giving the locations of the two gazetteer list definitions, the element mapping file and the JAPE grammar used by the tagger (in previous versions of the tagger these files were fixed and loaded from inside the ChemTagger.jar file). Unless you know what you are doing you should accept the default values.
The annotations added to documents are ‘ChemicalCompound’, ‘ChemicalIon’ and ‘ChemicalElement’ (currently they are always placed in the default annotation set). By default ‘ChemicalElement’ annotations are removed if they make up part of a larger compound or ion annotation. This behaviour can be changed by setting the removeElements parameter to false so that all recognised chemical elements are annotated.
21.5 Annotating Numbers [#]
The Tagger_Numbers creole repository contains a number of processing resources which are designed to annotate numbers appearing within documents. As well as annotating a given span as being a number the PRs also determine the exact numeric value of the number and add this as a feature of the annotation. This makes the annotations created by these PRs ideal for building more complex annotations such as measurements or monetary units.
All the PRs in this plugin produce Number annotations with the following standard features
- type: this describes the types of tokens that make up the number, e.g. roman, words, numbers
- value: this is the actual value (stored as a Double) of the number that has been annotated
Each PR might also create other features which are described, along with the PR, in the following sections.
21.5.1 Numbers in Words and Numbers [#]
String | Value |
3^2 | 9 |
101 | 101 |
3,000 | 3000 |
3.3e3 | 3300 |
1/4 | 0.25 |
9^1/2 | 3 |
4x10^3 | 4000 |
5.5*4^5 | 5632 |
thirty one | 31 |
three hundred | 300 |
four thousand one hundred and two | 4102 |
3 million | 3000000 |
fünfundzwanzig | 25 |
The “Numbers Tagger” annotates numbers made up from numbers or numeric words. If that wasn’t really clear enough then Table 21.1 shows numerous ways of representing numbers that can all be annotated by this tagger (depending upon the configuration files used).
To create an instance of the PR you will need to configure the following initialization time parameters (sensible defaults are provided):
- configURL: the URL of the configuration file you wish to use (see below for details), defaults to resources/languages/all.xml which currently provides support for English, French, German, Spanish and a variety of number related Unicode symbols. If you want a single language the you can specify the appropriately named file, i.e. resources/languages/english.xml.
- encoding: the encoding of the configuration file, defaults to UTF-8
- postProcessURL: the URL of the JAPE grammar used for post-processing – don’t change this unless you know what you are doing!
<description>Basic Example</description>
<imports>
<url encoding="UTF-8">symbols.xml</url>
</imports>
<words>
<word value="0">zero</word>
<word value="1">one</word>
<word value="2">two</word>
<word value="3">three</word>
<word value="4">four</word>
<word value="5">five</word>
<word value="6">six</word>
<word value="7">seven</word>
<word value="8">eight</word>
<word value="9">nine</word>
<word value="10">ten</word>
</words>
<multipliers>
<word value="2">hundred</word>
<word value="2">hundreds</word>
<word value="3">thousand</word>
<word value="3">thousands</word>
</multipliers>
<conjunctions>
<word whole="true">and</word>
</conjunctions>
<decimalSymbol>.</decimalSymbol>
<digitGroupingSymbol>,</digitGroupingSymbol>
</config>
The configuration file is an XML document that specifies the words that can be used as numbers or multipliers (such as hundred, thousand, ...) and conjunctions that can then be used to combine sequences of numbers together. An example configuration file can be seen in Figure 21.2. This configuration file specifies a handful of words and multipliers and a single conjunction. It also imports another configuration file (in the same format) defining Unicode symbols. Most of the configuration file is self-explanatory, however, the conjunctions needs further clarification. In English conjunctions are whole words, that is they require white space on either side of them, e.g. three hundred and one. In other languages, however, numbers can be joined into a single word using a conjunction. For example, in German the conjunction ‘und’ can appear in a number without white space, e.g. twenty one is written as einundzwanzig. If the conjunction is a whole word, as in English, then the whole attribute should be set to true, but for conjunctions like ‘und’ the attribute should be set to false.
In order to support different number formats the symbols used to group numbers and to represent the decimal point can also be configured. These are optional elements in the XML configuration file which if not supplied default to a comma for the digit group symbol and a full stop for the decimal point. Whilst these are appropriate for many languages if you wanted, for example, to parse documents written in Bulgarian you would want to specify that the decimal symbol was a command and the grouping symbol was a space in order to recognise numbers such as 1 000 000,303.
Once created an instance of the PR can then be configured using the following runtime parameters:
- allowWithinWords: digits can often occur within words (for example part numbers, chemical equations etc.) where they should not be interpreted as numbers. If this parameter is set to true then these instances will also be annotated as numbers (useful for annotating money and measurements where spaces are often omitted), however, the parameter defaults to false.
- annotationSetName: the annotation set to use as both input and output for this PR (due to the way this PR works the two sets have to be the same)
- failOnMissingInputAnnotations: if the input annotations (Tokens and Sentences) are missing should this PR fail or just not do anything, defaults to true to allow obvious mistakes in pipeline configuration to be captured at an early stage.
- useHintsFromOriginalMarkups: often the original markups will provide hints that may be useful for correctly interpreting numbers within documents (i.e. numeric powers may be in <sup></sup> tags), if this parameter is set to true then these hints will be used to help parse the numbers, defaults to true.
There are no extra annotation features which are specific to this numbers PR. The type feature can take one of three values based upon the text that is annotated; words, numbers, wordsAndNumbers.
21.5.2 Roman Numerals [#]
The “Roman Numerals Tagger” annotates Roman numerals appearing in the document. The tagger is configured using the following runtime parameters:
- allowLowerCase: traditionally Roman numerals must be all in uppercase. Setting this parameter to false, however, allows Roman numerals written in lowercase to also be annotated. This parameter defaults to false.
- maxTailLength: Roman numerals are often used in labelling sections, figures, tables etc. and in such cases can be followed by additional information. For example, Table IVa, Appendix IIIb. These characters are referred to as the tail of the number and this parameter constrains the number of characters that can appear. The default value is 0 in which case strings such as ’IVa’ would not be annotated in any way.
- outputASName: the name of the annotation set in which the Number annotations should be created.
As well as the normal Number annotation features (the type feature will always take the value ‘roman’ Roman numeral annotations also include the following features:
- tail: contains the tail, if any, that appears after the Roman numeral.
21.6 Annotating Measurements [#]
Measurements mentioned in text documents can be difficult to accurately deal with. As well as the numerous ways in which numeric values can be written each type of measurement (distance, area, time etc.) can be written using a variety of different units. For example, lengths can be measured in metres, centimetres, inches, yards, miles, furlongs and chains, to mention just a few. Whilst measurements may all have different units and values they can, in theory be compared to one another. Extracting, normalizing and comparing measurements can be a useful IE process in many different domains. The Measurement Tagger (which can be found in the Tagger_Measurements plugin) attempts to provide such annotations for use within IE applications.
The Measurements Tagger uses a parser based upon a modified version of the Java port of the GNU Units package. This allows us to not only recognise and annotation spans of text as being a measurement but also to normalize the units to allow for easy comparison of different measurement values.
This PR actually produces two different annotations; Measurement and Ratio.
Measurement annotations represent measurements that involve a unit, e.g. 3mph, three pints, 4 m3. Single measurements (i.e. those not referring to a range or interval) are referred to as scalar measurements and have the following features:
- type: for scalar measurements is always scalar
- unit: the unit as recognised from the text. Note that this won’t necessarily be the annotated text. For example, an annotation spanning the text “three miles” would have a unit feature of “mile”.
- value: a Double holding the value of the measurement (this usually comes directly from the value feature of a Number annotation).
- dimension: the measurements dimension, e.g. speed, volume, area, length, time etc.
- normalizedUnit: to enable measurements of the same dimension but specified in different units to be compared the PR reduces all units to their base form. A base form usually consists of a combination of SI units. For example, centimetre, mm, and kilometre are all normalized to m (for metre).
- normalizedValue: a Double instance holding the normalized value, such that the combination of the normalized value and normalized unit represent the same measurement as the original value and unit.
- normalized: a String representing the normalized measurement (usually a simple space separated concatenation of the normalized value and unit).
Annotations which represent an interval or range have a slightly different set of features. The type feature is set to interval, there is no normalized or unit feature and the value features (included the normalized version) are replaced by the following features, the values of which are simply copied from the Measurement annotations which mark the boundaries of the interval.
- normalizedMinValue: a Double representing the minimum normalized number that forms part of the interval.
- normalizedMaxValue: a Double representing the minimum normalized number that forms part of the interval.
Interval annotations do not replace scalar measurements and so multiple Measurement annotations may well overlap. They can of course be distinguished by the type feature.
As well as Measurement annotations the tagger also adds Ratio annotations to documents. Ratio annotations cover measurements that do not have a unit. Percentages are the most common ratios to be found in documents, but also amounts such as “300 parts per million” are annotated.
A Ratio annotation has the following features:
- value: a Double holding the actual value of the ratio. For example, 20% will have a value of 0.2.
- numerator: the numerator of the ratio. For example, 20% will have a numerator of 20.
- denominator: the denominator of the ratio. For example, 20% will have a denominator of 100.
An instance of the measurements tagger is created using the following initialization parameters:
- commonURL: this file defines units that are also common words and so should not be annotated as a measurement unless they form a compound unit involving two or more unit symbols. For example, C is the accepted abbreviation for coulomb but often appears in documents as part of a reference to a table or figure, i.e. Figure 3C, which should not be annotated as a measurement. The default file was hand tuned over a large patent corpus but may need to be edited when used with different domains.
- encoding: the encoding to use when reading both of the configuration files, defaults to UTF-8.
- japeURL: the URL of the JAPE grammar that drives the measurement parser. Unless you really know what you are doing, the value of this parameter should not be changed.
- locale: the locale to use when parsing the units definition file, defaults to en_GB.
- unitsURL: the URL of the main unit definition file to use. This should be in the same format as accepted by the GNU Units package.
The PR does not attempt to recognise or annotate numbers, instead it relies on Number annotations being present in the document. Whilst these annotations could be generated by any resource executed prior to the measurements tagger, we recommend using the Numbers Tagger described in Section 21.5. If you choose to produce Number annotations in some other way note that they must have a value feature containing a Double representing the value of the number. An example GATE application, showing how to configure and use the two PRs together, is provided with the measurements plugin.
Once created an instance of the tagger can be configured using the following runtime parameters:
- consumeNumberAnnotations: if true then Number annotations used to find measurements will be consumed and removed from the document, defaults to true.
- failOnMissingInputAnnotations: if the input annotations (Tokens) are missing should this PR fail or just not do anything, defaults to true to allow obvious mistakes in pipeline configuration to be captured at an early stage.
- ignoredAnnotations: a list of annotation types in which a measurement can never occur, defaults to a set containing Date and Money.
- inputASName: the annotation set used as input to this PR.
- outputASName: the annotation set to which new annotations will be added.
The ability to prevent the tagger from annotating measurements which occur within other annotations is a very useful feature. The runtime parameters, however, only allow you to specify the names of annotations and not to restrict on feature values or any other information you may know about the documents being processed. Internally ignoring sections of a document is controlled by adding CannotBeAMeasurement annotations that span the text to be ignored. If you need greater control over the process than the ignoredAnnotations parameter allows then you can create CannotBeAMeasurement annotations prior to running the measurement tagger, for example a JAPE grammar placed before the tagger in the pipeline. Note that these annotations will be deleted by the measurements tagger once processing has completed.
21.7 Annotating and Normalizing Dates [#]
Many information extraction tasks benefit from or require the extraction of accurate date information. While ANNIE (Chapter 6) does produce Date annotations no attempt is made to normalize these dates, i.e. to firmly fix all dates, even partial or relative ones, to a timeline using a common date representation. The PR in the Tagger_DateNormalizer plugin attempts to fill this gap by normalizing dates against the date of the document (see below for details on how this is determined) in order to tie each Date annotation to a specific date. This includes normalizing dates such as April 1st, today, yesterday, and next Tuesday, as well as converting fully specified dates (ones in which the day, month and year are specified) into a common format.
Different cultures/countries have different conventions for writing dates, as well as different languages using different words for the days of the week and the months of the year. The parser underlying this PR makes use of the locale-specific information when parsing documents. When initializing an instance of the Date Normalizer you can specify the locale to use using ISO language and country codes along with Java specific variants (for details of these codes see the Java Locale documentation). So for example, to specify British English (which means the day usually comes before the month in a date) use en_GB, or for American English (where the month usually appears before the day in a date) specify en_US. If you need to override the locale on a document basis then you can do this by setting a document feature called locale to a string encoded as above. If neither the initialization parameter or document feature are present or do not represent a valid locale then the default locale of the JVM running GATE will be used.
Once initialized and added to a pipeline the Date Normalizer has the following runtime parameters that can be used to control it’s behaviour.
- annotationName: the annotation type created by this PR, defaults to Date.
- dateFormat: the format that dates should be normalized to. The format of this parameter is the same as that use by the Java SimpleDateFormat whose documentation describes the full range of possible formats (note you must use MM for month and not mm). This defaults to dd/MM/yyyy. Note that this parameter is only required if the numericOuput parameter is set to false.
- failOnMissingInputAnnotations: if the input annotations (Tokens) are missing should this PR fail or just not do anything, defaults to true to allow obvious mistakes in pipeline configuration to be captured at an early stage.
- inputASName: the annotation set used as input to this PR.
- normalizedDocumentFeature: if set then the normalized version of the document date will be stored in a document feature with this name. This parameter defaults to normalized-date although it can be left blank to suppress storage of the document date.
- numericOutput: if true then instead of formatting the normalized dates as String features of the Date annotations they are instead converted into a numeric representation. Specifically the first converted to the form yyyyMMdd and then cast to a Double. This is useful as dates can then be sorted numerical (which is fast) into order. If false then the formatting string in the dateFormat parameter is used instead to create a string representation. This defaults to false.
- outputASName: the annotation set to which new annotations will be added.
- sourceOfDocumentDate: this parameter is a list of the names of annotations, annotation features (encoded as Annotation.feature), and document features to inspect when trying to determine the date of the document. The PR works through the list getting the text of feature or under the annotation (if no feature is specified) and then parsing this to find a fully specified date, i.e. one where the day, month and year are all present. Once a date is found processing of the list stops and the date is used as the date of the document. If you specify an annotation that can occur multiple times in a document then they are sorted based on a numeric priority feature (which defaults to 0) or their order within the document. The idea here is that there are multiple ways in which to determine the date of a document but most are domain specific and this allows previous PRs in an application to determine the document date. This defaults to an empty list which is taken to assume that the document was written on the day it is being processed. The same assumption applies if no fully-specified date can be found once the whole list has been processed.
It is important to note that rather this plugin creates new Date annotations and so if you run it in the same pipeline as the ANNIE NE Transducer you will likely end up with overlapping Date annotations. Depending on your needs it may be that you need a JAPE grammar to delete ANNIE Date annotations before running this PR. In practice we have found that the Date annotations added by ANNIE can be a good source of document dates and so a JAPE grammar that uses ANNIE Dates to add new DocumentDate annotations and to delete other Date annotations can be a useful step before running this PR.
21.8 Snowball Based Stemmers [#]
The stemmer plugin, ‘Stemmer_Snowball’, consists of a set of stemmers PRs for the following 11 European languages: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish. These take the form of wrappers for the Snowball stemmers freely available from http://snowball.tartarus.org. Each Token is annotated with a new feature ‘stem’, with the stem for that word as its value. The stemmers should be run as other PRs, on a document that has been tokenised.
There are three runtime parameters which should be set prior to executing the stemmer on a document.
- annotationType: This is the type of annotations that represent tokens in the document. Default value is set to ‘Token’.
- annotationFeature: This is the name of a feature that contains tokens’ strings. The stemmer uses value of this feature as a string to be stemmed. Default value is set to ‘string’.
- annotationSetName: This is where the stemmer expects the annotations of type as specified in the annotationType parameter to be.
21.8.1 Algorithms
The stemmers are based on the Porter stemmer for English [Porter 80], with rules implemented in Snowball e.g.
( [substring] among (
’sses’ (<-’ss’)
’ies’ (<-’i’)
’ss’ () ’s’ (delete)
)
21.9 GATE Morphological Analyzer [#]
The Morphological Analyser PR can be found in the Tools plugin. It takes as input a tokenized GATE document. Considering one token and its part of speech tag, one at a time, it identifies its lemma and an affix. These values are than added as features on the Token annotation. Morpher is based on certain regular expression rules. These rules were originally implemented by Kevin Humphreys in GATE1 in a programming language called Flex. Morpher has a capability to interpret these rules with an extension of allowing users to add new rules or modify the existing ones based on their requirements. In order to allow these operations with as little effort as possible, we changed the way these rules are written. More information on how to write these rules is explained later in Section 21.9.1.
Two types of parameters, Init-time and run-time, are required to instantiate and execute the PR.
- rulesFile (Init-time) The rule file has several regular expression patterns. Each pattern has two parts, L.H.S. and R.H.S. L.H.S. defines the regular expression and R.H.S. the function name to be called when the pattern matches with the word under consideration. Please see 21.9.1 for more information on rule file.
- caseSensitive (init-time) By default, all tokens under consideration are converted into lowercase to identify their lemma and affix. If the user selects caseSensitive to be true, words are no longer converted into lowercase.
- document (run-time) Here the document must be an instance of a GATE document.
- affixFeatureName (run-time) Name of the feature that should hold the affix value.
- rootFeatureName (run-time) Name of the feature that should hold the root value.
- annotationSetName (run-time) Name of the annotationSet that contains Tokens.
- considerPOSTag (run-time) Each rule in the rule file has a separate tag, which specifies which rule to consider with what part-of-speech tag. If this option is set to false, all rules are considered and matched with all words. This option is very useful. For example if the word under consideration is "singing". "singing" can be used as a noun as well as a verb. In the case where it is identified as a verb, the lemma of the same would be "sing" and the affix "ing", but otherwise there would not be any affix.
- failOnMissingInputAnnotations (run-time) If set to true (the default) the PR will terminate with an Exception if none of the required input Annotations are found in a document. If set to false the PR will not terminate and instead log a single warning message per session and a debug message per document that has no input annotations.
21.9.1 Rule File [#]
GATE provides a default rule file, called default.rul, which is available under the gate/plugins/Tools/morph/resources directory. The rule file has two sections.
- Variables
- Rules
Variables
The user can define various types of variables under the section defineVars. These variables can be used as part of the regular expressions in rules. There are three types of variables:
- Range With this type of variable, the user can specify the range of characters. e.g. A ==> [-a-z0-9]
- Set With this type of variable, user can also specify a set of characters, where one character at a time from this set is used as a value for the given variable. When this variable is used in any regular expression, all values are tried one by one to generate the string which is compared with the contents of the document. e.g. A ==> [abcdqurs09123]
- Strings Where in the two types explained above, variables can hold only one character from the given set or range at a time, this allows specifying strings as possibilities for the variable. e.g. A ==> ‘bb’ OR ‘cc’ OR ‘dd’
Rules
All rules are declared under the section defineRules. Every rule has two parts, LHS and RHS. The LHS specifies the regular expression and the RHS the function to be called when the LHS matches with the given word. ‘==>’ is used as delimiter between the LHS and RHS.
The LHS has the following syntax:
< ” ∗ ”|”verb”|”noun” >< regularexpression >.
User can specify which rule to be considered when the word is identified as ‘verb’ or ‘noun’. ‘*’ indicates that the rule should be considered for all part-of-speech tags. If the part-of-speech should be used to decide if the rule should be considered or not can be enabled or disabled by setting the value of considerPOSTags option. Combination of any string along with any of the variables declared under the defineVars section and also the Kleene operators, ‘+’ and ‘*’, can be used to generate the regular expressions. Below we give few examples of L.H.S. expressions.
- <verb>"bias"
- <verb>"canvas"{ESEDING} "ESEDING" is a variable defined under the defineVars section. Note: variables are enclosed with "{" and "}".
- <noun>({A}*"metre") "A" is a variable followed by the Kleene operator "*", which means "A" can occur zero or more times.
- <noun>({A}+"itis") "A" is a variable followed by the Kleene operator "+", which means "A" can occur one or more times.
- < ∗ >"aches" "< ∗ >" indicates that the rule should be considered for all part-of-speech tags.
On the RHS of the rule, the user has to specify one of the functions from those listed below. These rules are hard-coded in the Morph PR in GATE and are invoked if the regular expression on the LHS matches with any particular word.
- stem(n, string, affix) Here,
- n = number of characters to be truncated from the end of the string.
- string = the string that should be concatenated after the word to produce the root.
- affix = affix of the word
- irreg_stem(root, affix) Here,
- root = root of the word
- affix = affix of the word
- null_stem() This means words are themselves the base forms and should not be analyzed.
- semi_reg_stem(n,string) semir_reg_stem function is used with the regular expressions that end with any of the {EDING} or {ESEDING} variables defined under the variable section. If the regular expression matches with the given word, this function is invoked, which returns the value of variable (i.e. {EDING} or {ESEDING}) as an affix. To find a lemma of the word, it removes the n characters from the back of the word and adds the string at the end of the word.
21.10 Flexible Exporter [#]
The Flexible Exporter enables the user to save a document (or corpus) in its original format with added annotations. The user can select the name of the annotation set from which these annotations are to be found, which annotations from this set are to be included, whether features are to be included, and various renaming options such as renaming the annotations and the file.
At load time, the following parameters can be set for the flexible exporter:
- includeFeatures - if set to true, features are included with the annotations exported; if false (the default status), they are not.
- useSuffixForDumpFiles - if set to true (the default status), the output files have the suffix defined in suffixForDumpFiles; if false, no suffix is defined, and the output file simply overwrites the existing file (but see the outputFileUrl runtime parameter for an alternative).
- suffixForDumpFiles - this defines the suffix if useSuffixForDumpFiles is set to true. By default the suffix is .gate.
- useStandOffXML - if true then the format will be the GATE XML format that separates nodes and annotations inside the file which allows overlapping annotations to be saved.
The following runtime parameters can also be set (after the file has been selected for the application):
- annotationSetName - this enables the user to specify the name of the annotation set which contains the annotations to be exported. If no annotation set is defined, it will use the Default annotation set.
- annotationTypes - this contains a list of the annotations to be exported. By default it is set to Person, Location and Date.
- dumpTypes - this contains a list of names for the exported annotations. If the annotation name is to remain the same, this list should be identical to the list in annotationTypes. The list of annotation names must be in the same order as the corresponding annotation types in annotationTypes.
- outputDirectoryUrl - this enables the user to specify the export directory where the file is exported with its original name and an extension (provided as a parameter) appended at the end of filename. Note that you can also save a whole corpus in one go. If not provided, use the temporary directory.
21.11 Annotation Set Transfer [#]
The Annotation Set Transfer allows copying or moving annotations to a new annotation set if they lie between the beginning and the end of an annotation of a particular type (the covering annotation). For example, this can be used when a user only wants to run a processing resource over a specific part of a document, such as the Body of an HTML document. The user specifies the name of the annotation set and the annotation which covers the part of the document they wish to transfer, and the name of the new annotation set. All the other annotations corresponding to the matched text will be transferred to the new annotation set. For example, we might wish to perform named entity recognition on the body of an HTML text, but not on the headers. After tokenising and performing gazetteer lookup on the whole text, we would use the Annotation Set Transfer to transfer those annotations (created by the tokeniser and gazetteer) into a new annotation set, and then run the remaining NE resources, such as the semantic tagger and coreference modules, on them.
The Annotation Set Transfer has no loadtime parameters. It has the following runtime parameters:
- inputASName - this defines the annotation set from which annotations will be transferred (copied or moved). If nothing is specified, the Default annotation set will be used.
- outputASName - this defines the annotation set to which the annotations will be transferred. This default value for this parameter is ‘Filtered’. If it is left blank the Default annotation set will be used.
- tagASName - this defines the annotation set which contains the annotation covering the relevant part of the document to be transferred. This default value for this parameter is ‘Original markups’. If it is left blank the Default annotation set will be used.
- textTagName - this defines the type of the annotation covering the annotations to be transferred. The default value for this parameter is ‘BODY’. If this is left blank, then all annotations from the inputASName annotation set will be transferred. If more than one covering annotation is found, the annotation covered by each of them will be transferred. If no covering annotation is found, the processing depends on the copyAllUnlessFound parameter (see below).
- copyAnnotations - this specifies whether the annotations should be moved or copied. The default value false will move annotations, removing them from the inputASName annotation set. If set to true the annotations will be copied.
- transferAllUnlessFound - this specifies what should happen if no covering annotation is found. The default value is true. In this case, all annotations will be copied or moved (depending on the setting of parameter copyAnnotations) if no covering annotation is found. If set to false, no annotation will be copied or moved.
- annotationTypes - if annotation type names are specified for this list, only candidate annotations of those types will be transferred or copied. If an entry in this list is specified in the form OldTypeName=NewTypeName, then annotations of type OldTypeName will be selected for copying or transfer and renamed to NewTypeName in the output annotation set.
For example, suppose we wish to perform named entity recognition on only the text covered by the BODY annotation from the Original Markups annotation set in an HTML document. We have to run the gazetteer and tokeniser on the entire document, because since these resources do not depend on any other annotations, we cannot specify an input annotation set for them to use. We therefore transfer these annotations to a new annotation set (Filtered) and then perform the NE recognition over these annotations, by specifying this annotation set as the input annotation set for all the following resources. In this example, we would set the following parameters (assuming that the annotations from the tokenise and gazetteer are initially placed in the Default annotation set).
- inputASName: Default
- outputASName: Filtered
- tagASName: Original markups
- textTagName: BODY
- copyAnnotations: true or false (depending on whether we want to keep the Token and Lookup annotations in the Default annotation set)
- copyAllUnlessFound: true
The AST PR makes a shallow copy of the feature map for each transferred annotation, i.e. it creates a new feature map containing the same keys and values as the original. It does not clone the feature values themselves, so if your annotations have a feature whose value is a collection and you need to make a deep copy of the collection value then you will not be able to use the AST PR to do this. Similarly if you are copying annotations and do in fact want to share the same feature map between the source and target annotations then the AST PR is not appropriate. In these sorts of cases a JAPE grammar or Groovy script would be a better choice.
21.12 Schema Enforcer [#]
One common use of the Annotation Set Transfer (AST) PR (see Section 21.11) is to create a ‘clean’ or final annotation set for a GATE application, i.e. an annotation set containing only those annotations which are required by the application without any temporary or intermediate annotations which may also have been created. Whilst really useful the AST suffers from two problems 1) it can be complex to configure and 2) it offers no support for modifying or removing features of the annotations it copies.
Many GATE applications are developed through a process which starts with experts manually annotating documents in order for the application developer to understand what is required and which can later be used for testing and evaluation. This is usually done using either GATE Teamware or within GATE Developer using the Schema Annotation Editor (Section 3.4.6). Either approach requires that each of the annotation types being created is described by an XML based Annotation Schema. The Schema Enforcer (part of the Schema_Tools plugin) uses these same schemas to create an annotation set, the contents of which, strictly matches the provided schemas.
The Schema Enforcer will copy an annotation if and only if....
- the type of the annotation matches one of the supplied schemas
- all required features are present and valid (i.e. meet the requirements for being copied to the ’clean’ annotation)
Each feature of an annotation is copied to the new annotation if and only if....
- the feature name matches a feature in the schema describing the annotation
- the value of the feature is of the same type as specified in the schema
- if the feature is defined, in the schema, as an enumerated type then the value must match one of the permitted values
The Schema Enforcer has no initialization parameters and is configured via the following runtime parameters:
- inputASName - - this defines the annotation set from which annotations will be copied. If nothing is specified, the default annotation set will be used.
- outputASName - this defines the annotation set to which the annotations will be transferred. This must be an empty or non-existent annotation set.
- schemas - a list of schemas that will be enforced when duplicating the input annotation set.
- useDefaults - if true then the default value for required features (specified using the value attribute in the XML schema) will be used to help complete an otherwise invalid annotation, defaults to false.
Whilst this PR makes the creation of a clean output set easy (given the schemas) it is worth noting that schemas can only define features which have basic types; string, integer, boolean, float, double, short, and byte. This means that you cannot define a feature which has an object as it’s value. For example, this prevents you defining a feature as a list of numbers. If this is an issue then it is trivial to write JAPE to copy extra features not specified in the schemas as the annotations have the same ID in both the input and output annotation sets. An example JAPE file for copying the matches feature created by the Orthomatcher PR (see Section 6.8) is provided.
21.13 Information Retrieval in GATE [#]
GATE comes with a full-featured Information Retrieval (IR) subsystem that allows queries to be performed against GATE corpora. This combination of IE and IR means that documents can be retrieved from the corpora not only based on their textual content but also according to their features or annotations. For example, a search over the Person annotations for ‘Bush’ will return documents with higher relevance, compared to a search in the content for the string ‘bush’. The current implementation is based on the most popular open source full-text search engine - Lucene (available at http://jakarta.apache.org/lucene/) but other implementations may be added in the future.
An Information Retrieval system is most often considered a system that accepts as input a set of documents (corpus) and a query (combination of search terms) and returns as input only those documents from the corpus which are considered as relevant according to the query. Usually, in addition to the documents, a proper relevance measure (score) is returned for each document. There exist many relevance metrics, but usually documents which are considered more relevant, according to the query, are scored higher.
Figure 21.3 shows the results from running a query against an indexed corpus in GATE.
term1 | term2 | ... | ... | termk | |
doc1 | w1,1 | w1,2 | ... | ... | w1,k |
doc2 | w2,1 | w2,1 | ... | ... | w2,k |
... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... |
docn | wn, 1 | wn,2 | ... | ... | wn,k |
Information Retrieval systems usually perform some preprocessing one the input corpus in order to create the document-term matrix for the corpus. A document-term matrix is usually presented as in Table 21.2, where doci is a document from the corpus, termj is a word that is considered as important and representative for the document and wi,j is the weight assigned to the term in the document. There are many ways to define the term weight functions, but most often it depends on the term frequency in the document and in the whole corpus (i.e. the local and the global frequency). Note that the machine learning plugin described in Chapter 18 can produce such document-term matrix (for detailed description of the matrix produced, see Section 18.2.4).
Note that not all of the words appearing in the document are considered terms. There are many words (called ‘stop-words’) which are ignored, since they are observed too often and are not representative enough. Such words are articles, conjunctions, etc. During the preprocessing phase which identifies such words, usually a form of stemming is performed in order to minimize the number of terms and to improve the retrieval recall. Various forms of the same word (e.g. ‘play’, ‘playing’ and ‘played’) are considered identical and multiple occurrences of the same term (probably ‘play’) will be observed.
It is recommended that the user reads the relevant Information Retrieval literature for a detailed explanation of stop words, stemming and term weighting.
IR systems, in a way similar to IE systems, are evaluated with the help of the precision and recall measures (see Section 10.1 for more details).
21.13.1 Using the IR Functionality in GATE
In order to run queries against a corpus, the latter should be ‘indexed’. The indexing process first processes the documents in order to identify the terms and their weights (stemming is performed too) and then creates the proper structures on the local file system. These file structures contain indexes that will be used by Lucene (the underlying IR engine) for the retrieval.
Once the corpus is indexed, queries may be run against it. Subsequently the index may be removed and then the structures on the local file system are removed too. Once the index is removed, queries cannot be run against the corpus.
Indexing the Corpus
In order to index a corpus, the latter should be stored in a serial datastore. In other words, the IR functionality is unavailable for corpora that are transient or stored in a RDBMS datastores (though support for the latter may be added in the future).
To index the corpus, follow these steps:
- Select the corpus from the resource tree (top-left pane) and from the context menu (right button click) choose ‘Index Corpus’. A dialogue appears that allows you to specify the index properties.
- In the index properties dialogue, specify the underlying IR system to be used (only Lucene is supported at present), the directory that will contain the index structures, and the set of properties that will be indexed such as document features, content, etc (the same properties will be indexed for each document in the corpus).
- Once the corpus in indexed, you may start running queries against it. Note that the directory specified for the index data should exist and be empty. Otherwise an error will occur during the index creation.
Querying the Corpus
To query the corpus, follow these steps:
- Create a SearchPR processing resource. All the parameters of SearchPR are runtime so they are set later.
- Create a “pipeline” application (not a “corpus pipeline”) containing the SearchPR.
- Set the following SearchPR parameters:
- The corpus that will be queried.
- The query that will be executed.
- The maximum number of documents returned.
A query looks like the following:
{+/-}field1:term1 {+/-}field2:term2 ? {+/-}fieldN:termNwhere field is the name of a index field, such as the one specified at index creation (the document content field is body) and term is a term that should appear in the field.
For example the query:
+body:government +author:CNNwill inspect the document content for the term ‘government’ (together with variations such as ‘governments’ etc.) and the index field named ‘author’ for the term ‘CNN’. The ‘author’ field is specified at index creation time, and is either a document feature or another document property.
- After the SearchPR is initialized, running the application executes the specified query over the specified corpus.
- Finally, the results are displayed (see fig.1) after a double-click on the SearchPR processing resource.
Removing the Index
An index for a corpus may be removed at any time from the ‘Remove Index’ option of the context menu for the indexed corpus (right button click).
21.13.2 Using the IR API
The IR API within GATE Embedded makes it possible for corpora to be indexed, queried and results returned from any Java application, without using GATE Developer. The following sample indexes a corpus, runs a query against it and then removes the index.
2// open a serial datastore
3SerialDataStore sds =
4Factory.openDataStore("gate.persist.SerialDataStore",
5"/tmp/datastore1");
6sds.open();
7
8//set an AUTHOR feature for the test document
9Document doc0 = Factory.newDocument(new URL("/tmp/documents/doc0.html"));
10doc0.getFeatures().put("author","John Smith");
11
12Corpus corp0 = Factory.newCorpus("TestCorpus");
13corp0.add(doc0);
14
15//store the corpus in the serial datastore
16Corpus serialCorpus = (Corpus) sds.adopt(corp0,null);
17sds.sync(serialCorpus);
18
19//index the corpus − the content and the AUTHOR feature
20
21IndexedCorpus indexedCorpus = (IndexedCorpus) serialCorpus;
22
23DefaultIndexDefinition did = new DefaultIndexDefinition();
24did.setIrEngineClassName(
25 gate.creole.ir.lucene.LuceneIREngine.class.getName());
26did.setIndexLocation("/tmp/index1");
27did.addIndexField(new IndexField("content",
28 new DocumentContentReader(), false));
29did.addIndexField(new IndexField("author", null, false));
30indexedCorpus.setIndexDefinition(did);
31
32indexedCorpus.getIndexManager().createIndex();
33//the corpus is now indexed
34
35//search the corpus
36Search search = new LuceneSearch();
37search.setCorpus(ic);
38
39QueryResultList res = search.search("+content:government +author:John");
40
41//get the results
42Iterator it = res.getQueryResults();
43while (it.hasNext()) {
44QueryResult qr = (QueryResult) it.next();
45System.out.println("DOCUMENT_ID=" + qr.getDocumentID()
46 + ", score=" + qr.getScore());
47}
21.14 Websphinx Web Crawler [#]
The ‘Web_Crawler_Websphinx’ plugin enables GATE to build a corpus from a web crawl. It is based on Websphinx, a JAVA-based, customizable, multi-threaded web crawler.
Note: if you are using this plugin via an IDE, you may need to make sure that the websphinx.jar file is on the IDE’s classpath, or add it to the IDE’s lib directory.
The basic idea is to specify a source URL (or set of documents created from web URLs) and a depth and maximum number of documents to build the initial corpus upon which further processing could be done. The PR itself provides a number of other parameters to regulate the crawl.
This PR now uses the HTTP Content-Type headers to determine each web page’s encoding and MIME type before creating a GATE Document from it. It also adds to each document a Date feature (with a java.util.Date value) based on the HTTP Last-Modified header (if available) or the current timestamp, an originalMimeType feature taken from the Content-Type header, and an originalLength feature indicating the size in bytes of the downloaded document.
21.14.1 Using the Crawler PR
In order to use the processing resource you need to load the plugin using the plugin manager, create an instance of the crawl PR from the list of processing resources, and create a corpus in which to store crawled documents. In order to use the crawler, create a simple pipeline (not a corpus pipeline) and add the crawl PR to the pipeline.
Once the crawl PR is created there will be a number of parameters that can be set based on the PR required (see also Figure 21.5).
- depth
- The depth (integer) to which the crawl should proceed.
- dfs
- A boolean:
- true
- the crawler visits links with a depth-first strategy;
- false
- the crawler visits links with a breadth-first strategy;
- domain
- An enum value, presented as a pull-down list in the GUI:
- SUBTREE
- The crawler visits only the descendents of the pages specified as the roots for the crawl.
- WEB
- The crawler can visit any pages on the web.
- SERVER
- The crawler can visit only pages that are present on the server where the root pages are located.
- max
- The maximum number (integer) of pages to be kept: the crawler will stop when it has stored this number of documents in the output corpus. Use −1 to ignore this limit.
- maxPageSize
- The maximum page size in kB; pages over this limit will be ignored—even as roots of the crawl—and their links will not be crawled. If your crawl does not add any documents (even the seeds) to the output corpus, try increasing this value. (A 0 or negative value here means “no limit”.)
- stopAfter
- The maximum number (integer) of pages to be fetched: the crawler will stop when it has visited this number of pages. Use −1 to ignore this limit. If max > stopAfter > 0 then the crawl will store at most stopAfter (not max) documents.
- root
- A string containing one URL to start the crawl.
- source
- A corpus that contains the documents whose gate.sourceURL features will be used to start the crawl. If you use both root and source parameters, both the root value and the URLs collected from the source documents will seed the crawl.
- outputCorpus
- The corpus in which the fetched documents will be stored.
- keywords
- A List<String> for matching against crawled documents. If this list is empty or null, all documents fetched will be kept. Otherwise, only documents that contain one of these strings will be stored in the output corpus. (Documents that are fetched but not kept are still scanned for further links.)
- keywordsCaseSensitive
- This boolean determines whether keyword matching is case-sensitive or not.
- convertXmlTypes
- GATE’s XmlDocumentFormat only accepts certain MIME types. If this parameter is true, the crawl PR converts other XML types (such as application/atom+xml.xml) to text/xml before trying to instantiate the GATE document (this allows GATE to handle RSS feeds, for example).
- userAgent
- If this parameter is blank, the crawler will use the default Websphinx user-agent header. Set this parameter to spoof the header.
Once the parameters are set, the crawl can be run and the documents fetched (and matched to the keywords, if that list is in use) are added to the specified corpus. Documents that are fetched but not matched are discarded after scanning them for further links.
Note that you must use a simple Pipeline, and not a Corpus Pipeline. In order to process the corpus of crawled documents, you need to build a separate Corpus Pipeline and run it after crawling. You could combine the two functions by carefully developing a Scriptable Controller (see section 7.16.3 for details).
21.14.2 Proxy configuration
The underlying WebSPHINX crawler uses Java’s URLConnection class, which respects the JVM’s proxy configuration (if it is set). To configure a proxy for GATE Developer, edit or create the file build.properties with the following lines (the first line is required, and the rest should be changed as necessary for your configuration):
http.proxyHost=proxy.example.com
http.proxyPort=8080
http.nonProxyHosts=*.example.com
Save the file and restart GATE Developer and it should start using your configured proxy settings. The proxy server, port, and exceptions can also be set using the Java control panel, but GATE will use them only if run.java.net.useSystemProxies=true is set in the build.properties file. Consult the Oracle Java Networking and Proxies documentation2 for further details.
21.15 WordNet in GATE [#]
GATE currently supports versions 1.6 and newer of WordNet, so in order to use WordNet in GATE, you must first install a compatible version of WordNet on your computer. WordNet is available at http://wordnet.princeton.edu/. The next step is to configure GATE to work with your local WordNet installation. Since GATE relies on the Java WordNet Library (JWNL) for WordNet access, this step consists of providing one special xml file that is used internally by JWNL. This file describes the location of your local copy of the WordNet index files. An example of this wn-config.xml file is shown below:
<?xml version="1.0" encoding="UTF-8"?>
<jwnl_properties language="en">
<version publisher="Princeton" number="3.0" language="en"/>
<dictionary class="net.didion.jwnl.dictionary.FileBackedDictionary">
<param name="morphological_processor"
value="net.didion.jwnl.dictionary.morph.DefaultMorphologicalProcessor">
<param name="operations">
<param value=
"net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
<param value="net.didion.jwnl.dictionary.morph.DetachSuffixesOperation">
<param name="noun"
value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/>
<param name="verb"
value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/>
<param name="adjective"
value="|er=|est=|er=e|est=e|"/>
<param name="operations">
<param
value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
<param
value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
</param>
</param>
<param value="net.didion.jwnl.dictionary.morph.TokenizerOperation">
<param name="delimiters">
<param value=" "/>
<param value="-"/>
</param>
<param name="token_operations">
<param
value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
<param
value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
<param
value="net.didion.jwnl.dictionary.morph.DetachSuffixesOperation">
<param name="noun"
value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/>
<param name="verb"
value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/>
<param name="adjective" value="|er=|est=|er=e|est=e|"/>
<param name="operations">
<param value=
"net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
<param value=
"net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
</param>
</param>
</param>
</param>
</param>
</param>
<param name="dictionary_element_factory" value=
"net.didion.jwnl.princeton.data.PrincetonWN17FileDictionaryElementFactory"/>
<param name="file_manager" value=
"net.didion.jwnl.dictionary.file_manager.FileManagerImpl">
<param name="file_type" value=
"net.didion.jwnl.princeton.file.PrincetonRandomAccessDictionaryFile"/>
<param name="dictionary_path" value="/home/mark/WordNet-3.0/dict/"/>
</param>
</dictionary>
<resource class="PrincetonResource"/>
</jwnl_properties>
There are three things in this file which you need to configure based upon the version of WordNet you wish to use. Firstly change the number attribute of the version element to match the version of WordNet you are using. Then edit the value of the dictionary_path parameter to point to your local installation of WordNet (this is /usr/share/wordnet/ if you have installed the Ubuntu or Debian wordnet-base package.)
Finally, if you want to use version 1.6 of WordNet then you also need to alter the dictionary_element_factory to use net.didion.jwnl.princeton.data.PrincetonWN16FileDictionaryElementFactory. For full details of the format of the configuration file see the JWNL documentation at http://sourceforge.net/projects/jwordnet.
After configuring GATE to use WordNet, you can start using the built-in WordNet browser or API. In GATE Developer, load the WordNet plugin via the Plugin Management Console. Then load WordNet by selecting it from the set of available language resources. Set the value of the parameter to the path of the xml properties file which describes the WordNet location (wn-config).
Once WordNet is loaded in GATE Developer, the well-known interface of WordNet will appear. You can search Word Net by typing a word in the box next to to the label ‘SearchWord” and then pressing ‘Search’. All the senses of the word will be displayed in the window below. Buttons for the possible parts of speech for this word will also be activated at this point. For instance, for the word ‘play’, the buttons ‘Noun’, ‘Verb’ and ‘Adjective’ are activated. Pressing one of these buttons will activate a menu with hyponyms, hypernyms, meronyms for nouns or verb groups, and cause for verbs, etc. Selecting an item from the menu will display the results in the window below.
To upgrade any existing GATE applications to use this improved WordNet plugin simply replace your existing configuration file with the example above and configure for WordNet 1.6. This will then give results identical to the previous version – unfortunately it was not possible to provide a transparent upgrade procedure.
More information about WordNet can be found at http://wordnet.princeton.edu/
More information about the JWNL library can be found at http://sourceforge.net/projects/jwordnet
An example of using the WordNet API in GATE is available on the GATE examples page at http://gate.ac.uk/wiki/code-repository/index.html.
21.15.1 The WordNet API
GATE Embedded offers a set of classes that can be used to access the WordNet Lexical Database. The implementation of the GATE API for WordNet is based on Java WordNet Library (JWNL). There are just a few basic classes, as shown in Figure 21.8. Details about the properties and methods of the interfaces/classes comprising the API can be obtained from the JavaDoc. Below is a brief overview of the interfaces:
- WordNet: the main WordNet class. Provides methods for getting the synsets of a lemma, for accessing the unique beginners, etc.
- Word: offers access to the word’s lemma and senses
- WordSense: gives access to the synset, the word, POS and lexical relations.
- Synset: gives access to the word senses (synonyms) in the synset, the semantic relations, POS etc.
- Verb: gives access to the verb frames (not working properly at present)
- Adjective: gives access to the adj. position (attributive, predicative, etc.).
- Relation: abstract relation such as type, symbol, inverse relation, set of POS tags, etc. to which it is applicable.
- LexicalRelation
- SemanticRelation
- VerbFrame
21.16 Kea - Automatic Keyphrase Detection [#]
Kea is a tool for automatic detection of key phrases developed at the University of Waikato in New Zealand. The home page of the project can be found at http://www.nzdl.org/Kea/.
This user guide section only deals with the aspects relating to the integration of Kea in GATE. For the inner workings of Kea, please visit the Kea web site and/or contact its authors.
In order to use Kea in GATE Developer, the ‘Keyphrase_Extraction_Algorithm’ plugin needs to be loaded using the plugins management console. After doing that, two new resource types are available for creation: the ‘KEA Keyphrase Extractor’ (a processing resource) and the ‘KEA Corpus Importer’ (a visual resource associated with the PR).
21.16.1 Using the ‘KEA Keyphrase Extractor’ PR
Kea is based on machine learning and it needs to be trained before it can be used to extract keyphrases. In order to do this, a corpus is required where the documents are annotated with keyphrases. Corpora in the Kea format (where the text and keyphrases are in separate files with the same name but different extensions) can be imported into GATE using the ‘KEA Corpus Importer’ tool. The usage of this tool is presented in a subsection below.
Once an annotated corpus is obtained, the ‘KEA Keyphrase Extractor’ PR can be used to build a model:
- load a ‘KEA Keyphrase Extractor’
- create a new ‘Corpus Pipeline’ controller.
- set the corpus for the controller
- set the ‘trainingMode’ parameter for the PR to ‘true’
- run the application.
After these steps, the Kea PR contains a trained model. This can be used immediately by switching the ‘trainingMode’ parameter to ‘false’ and running the PR over the documents that need to be annotated with keyphrases. Another possibility is to save the model for later use, by right-clicking on the PR name in the right hand side tree and choosing the ‘Save model’ option.
When a previously built model is available, the training procedure does not need to be repeated, the existing model can be loaded in memory by selecting the ‘Load model’ option in the PR’s context menu.
The Kea PR uses several parameters as seen in Figure 21.9:
- document
- The document to be processed.
- inputAS
- The input annotation set. This parameter is only relevant when the PR is running in training mode and it specifies the annotation set containing the keyphrase annotations.
- outputAS
- The output annotation set. This parameter is only relevant when the PR is running in application mode (i.e. when the ‘trainingMode’ parameter is set to false) and it specifies the annotation set where the generated keyphrase annotations will be saved.
- minPhraseLength
- the minimum length (in number of words) for a keyphrase.
- minNumOccur
- the minimum number of occurrences of a phrase for it to be a keyphrase.
- maxPhraseLength
- the maximum length of a keyphrase.
- phrasesToExtract
- how many different keyphrases should be generated.
- keyphraseAnnotationType
- the type of annotations used for keyphrases.
- dissallowInternalPeriods
- should internal periods be disallowed.
- trainingMode
- if ‘true’ the PR is running in training mode; otherwise it is running in application mode.
- useKFrequency
- should the K-frequency be used.
21.16.2 Using Kea Corpora
The authors of Kea provide on the project web page a few manually annotated corpora that can be used for training Kea. In order to do this from within GATE, these corpora need to be converted to the format used in GATE (i.e. GATE documents with annotations). This is possible using the ‘KEA Corpus Importer’ tool which is available as a visual resource associated with the Kea PR. The importer tool can be made visible by double-clicking on the Kea PR’s name in the resources tree and then selecting the ‘KEA Corpus Importer’ tab, see Figure 21.10.
The tool will read files from a given directory, converting the text ones into GATE documents and the ones containing keyphrases into annotations over the documents.
The user needs to specify a few values:
- Source Directory
- the directory containing the text and key files. This can be typed in or selected by pressing the folder button next to the text field.
- Extension for text files
- the extension used for text fields (by default .txt).
- Extension for keyphrase files
- the extension for the files listing keyphrases.
- Encoding for input files
- the encoding to be used when reading the files.
- Corpus name
- the name for the GATE corpus that will be created.
- Output annotation set
- the name for the annotation set that will contain the keyphrases read from the input files.
- Keyphrase annotation type
- the type for the generated annotations.
21.17 Annotation Merging Plugin [#]
If we have annotations about the same subject on the same document from different annotators, we may need to merge the annotations.
This plugin implements two approaches for annotation merging.
MajorityVoting takes a parameter numMinK and selects the annotation on which at least numMinK annotators agree. If two or more merged annotations have the same span, then the annotation with the most supporters is kept and other annotations with the same span are discarded.
MergingByAnnotatorNum selects one annotation from those annotations with the same span, which the majority of the annotators support. Note that if one annotator did not create the annotation with the particular span, we count it as one non-support of the annotation with the span. If it turns out that the majority of the annotators did not support the annotation with that span, then no annotation with the span would be put into the merged annotations.
The annotation merging methods are available via the Annotation Merging plugin. The plugin can be used as a PR in a pipeline or corpus pipeline. To use the PR, each document in the pipeline or the corpus pipeline should have the annotation sets for merging. The annotation merging PR has no loading parameters but has several run-time parameters, explained further below.
The annotation merging methods are implemented in the GATE API, and are available in GATE Embedded as described in Section 7.18.
- annSetOutput: the annotation set in the current document for storing the merged annotations. You should not use an existing annotation set, as the contents may be deleted or overwritten.
- annSetsForMerging: the annotation sets in the document for merging. It is an optional parameter. If it is not assigned with any value, the annotation sets for merging would be all the annotation sets in the document except the default annotation set. If specified, it is a sequence of the names of the annotation sets for merging, separated by ‘;’. For example, the value ‘a-1;a-2;a-3’ represents three annotation set, ‘a-1’, ‘a-2’ and ‘a-3’.
- annTypeAndFeats: the annotation types in the annotation set for merging. It is an optional parameter. It specifies the annotation types in the annotation sets for merging. For each type specified, it may also specify an annotation feature of the type. The parameter is a sequence of names of annotation types, separated by ‘;’. A single annotation feature can be specified immediately following the annotation type’s name, separated by ‘->’ in the sequence. For example, the value ‘SENT->senRel;OPINION_OPR;OPINION_SRC->type’ specifies three annotation types, ‘SENT’, ‘OPINION_OPR’ and ‘OPINION_SRC’ and specifies the annotation feature ‘senRel’ and ‘type’ for the two types SENT and OPINION_SRC, respectively but does not specify any feature for the type OPINION_OPR. If the annTypeAndFeats parameter is not set, the annotation types for merging are all the types in the annotation sets for merging, and no annotation feature for each type is specified.
- keepSourceForMergedAnnotations: should source annotations be kept in the annSetsForMerging annotation sets when merged? True by default.
- mergingMethod: specifies the method used for merging. Possible values are MajorityVoting and MergingByAnnotatorNum, referring to the two merging methods described above, respectively.
- minimalAnnNum: specifies the minimal number of annotators who agree on one annotation in order to put the annotation into merged set, which is needed by the merging method MergingByAnnotatorNum. If the value of the parameter is smaller than 1, the parameter is taken as 1. If the value is bigger than total number of annotation sets for merging, it is taken to be total number of annotation sets. If no value is assigned, a default value of 1 is used. Note that the parameter does not have any effect on the other merging method MajorityVoting.
21.18 Copying Annotations between Documents [#]
Sometimes a document has two copies, each of which was annotated by different annotators for the same task. We may want to copy the annotations in one copy to the other copy of the document. This could be in order to use less resources, or so that we can process them with some other plugin, such as annotation merging or IAA. The Copy_Annots_Between_Docs plugin does exactly this.
The plugin is available with the GATE distribution. When loading the plugin into GATE, it is represented as a processing resource, Copy Anns to Another Doc PR. You need to put the PR into a Corpus Pipeline to use it. The plugin does not have any initialisation parameters. It has several run-time parameters, which specify the annotations to be copied, the source documents and target documents. In detail, the run-time parameters are:
- sourceFilesURL specifies a directory in which the source documents are in. The source documents must be GATE xml documents. The plugin copies the annotations from these source documents to target documents.
- inputASName specifies the name of the annotation set in the source documents. Whole annotations or parts of annotations in the annotation set will be copied.
- annotationTypes specifies one or more annotation types in the annotation set inputASName which will be copied into target documents. If no value is given, the plugin will copy all annotations in the annotation set.
- outputASName specifies the name of the annotation set in the target documents, into which the annotations will be copied. If there is no such annotation set in the target documents, the annotation set will be created automatically.
The Corpus parameter of the Corpus Pipeline application containing the plugin specifies a corpus which contains the target documents. Given one (target) document in the corpus, the plugin tries to find a source document in the source directory specified by the parameter sourceFilesURL, according to the similarity of the names of the source and target documents. The similarity of two file names is calculated by comparing the two strings of names from the start to the end of the strings. Two names have greater similarity if they share more characters from the beginning of the strings. For example, suppose two target documents have the names aabcc.xml and abcab.xml and three source files have names abacc.xml, abcbb.xml and aacc.xml, respectively. Then the target document aabcc.xml has the corresponding source document aacc.xml, and abcab.xml has the corresponding source document abcbb.xml.
21.19 OpenCalais Plugin [#]
OpenCalais provides a web service for semantic annotation of text. The user submits a document to the web service, which returns entity and relations annotations in RDF, JSON or some other format. Typically, users integrate OpenCalais annotation of their web pages to provide additional links and ‘semantic functionality’. OpenCalais can be found at http://www.opencalais.com
The GATE OpenCalais PR submits a GATE document to the OpenCalais web service, and adds the annotations from the OpenCalais response as GATE annotations in the GATE document. It therefore provides OpenCalais semantic annotation functionality within GATE, for use by other PRs.
The PR only supports OpenCalais entities, not relations - although this should be straightforward for a competent Java programmer to add. Each OpenCalais entity is represented in GATE as an OpenCalais annotation, with features as given in the OpenCalais documentation.
The PR can be loaded with the CREOLE plugin manager dialog, from the creole directory in the gate distribution, gate/plugins/Tagger_OpenCalais. In order to use the PR, you will need to have an OpenCalais account, and request an OpenCalais service key. You can do this from the OpenCalais web site at http://www.opencalais.com. Provide your service key as an initialisation parameter when you create a new OpenCalais PR in GATE. OpenCalais make restrictions on the the number of requests you can make to their web service. See the OpenCalais web page for details.
Initialisation parameters are:
- openCalaisURL This is the URL of the OpenCalais REST service, and should not need to be changed - unless OpenCalais moves it!
- licenseID Your OpenCalais service key. This has to be requested from OpenCalais and is specific to you.
Various runtime parameters are available from the OpenCalais API, and are named the same as in that API. See the OpenCalais documentation for further details.
21.20 LingPipe Plugin [#]
LingPipe is a suite of Java libraries for the linguistic analysis of human language3. We have provided a plugin called ‘LingPipe’ with wrappers for some of the resources available in the LingPipe library. In order to use these resources, please load the ‘LingPipe’ plugin. Currently, we have integrated the following five processing resources.
- LingPipe Tokenizer PR
- LingPipe Sentence Splitter PR
- LingPipe POS Tagger PR
- LingPipe NER PR
- LingPipe Language Identifier PR
Please note that most of the resources in the LingPipe library allow learning of new models. However, in this version of the GATE plugin for LingPipe, we have only integrated the application functionality. You will need to learn new models with Lingpipe outside of GATE. We have provided some example models under the ‘resources’ folder which were downloaded from LingPipe’s website. For more information on licensing issues related to the use of these models, please refer to the licensing terms under the LingPipe plugin directory.
The LingPipe system can be loaded from the GATE GUI by simply selecting the ‘Load LingPipe System’ menu item under the ‘File’ menu. This is similar to loading the ANNIE application with default values.
21.20.1 LingPipe Tokenizer PR [#]
As the name suggests this PR tokenizes document text and identifies the boundaries of tokens. Each token is annotated with an annotation of type ‘Token’. Every annotation has a feature called ‘length’ that gives a length of the word in number of characters. There are no initialization parameters for this PR. The user needs to provide the name of the annotation set where the PR should output Token annotations.
21.20.2 LingPipe Sentence Splitter PR [#]
As the name suggests, this PR splits document text in sentences. It identifies sentence boundaries and annotates each sentence with an annotation of type ‘Sentence’. There are no initialization parameters for this PR. The user needs to provide name of the annotation set where the PR should output Sentence annotations.
21.20.3 LingPipe POS Tagger PR [#]
The LingPipe POS Tagger PR is useful for tagging individual tokens with their respective part of speech tags. Each document must already have been processed with a tokenizer and a sentence splitter (any kinds in GATE, not necessarily the LingPipe ones) since this PR has Token and Sentence annotations as prerequisites. This PR adds a category feature to each token.
This PR requires a model (dataset from training the tagger on a tagged corpus), which must be provided as an initialization parameter. Several models are included in this plugin’s resources directory. Additional models can be downloaded from the LingPipe website4 or trained according to LingPipe’s instructions5.
Two models for Bulgarian are now available in GATE: bulgarian-full.model and bulgarian-simplified.model, trained on a transformed version of the BulTreeBank-DP [Osenova & Simov 04, Simov & Osenova 03, Simov et al. 02, Simov et al. 04a]. The full model uses the complete tagset [Simov et al. 04b] whereas the simplified model uses tags truncated before any hyphens (for example, Pca–p, Pca–s-f, Pca–s-m, Pca–s-n, and Pce-as-m are all merged to Pca) to improve performance. This reduces the set from 573 to 249 tags and saves memory.
This PR has the following run-time parameters.
- inputASName
- The name of the annotation set with Token and Sentence annotations.
- applicationMode
- The POS tagger can be applied on the text in three different modes.
- FIRSTBEST
- The tagger produces one tag for each token (the one that it calculates is best) and stores it as a simple String in the category feature.
- CONFIDENCE
- The tagger produces the best five tags for each token, with confidence scores, and stores them as a Map<String, Double> in the category feature. This application mode requires more memory than the others.
- NBEST
- The tagger produces the five best taggings for the whole document and then stores one to five tags for each token (with document-based scores) as a Map<String, List<Double» in the category feature. This application mode is noticeably slower than the others.
21.20.4 LingPipe NER PR [#]
The LingPipe NER PR is used for named entity recognition. The PR recognizes entities such as Persons, Organizations and Locations in the text. This PR requires a model which it then uses to classify text as different entity types. An example model is provided under the ‘resources’ folder of this plugin. It must be provided at initialization time. Similar to other PRs, this PR expects users to provide name of the annotation set where the PR should output annotations.
21.20.5 LingPipe Language Identifier PR [#]
As the name suggests, this PR is useful for identifying the language of a document or span of text. This PR uses a model file to identify the language of a text. A model is provided in this plugin’s resources/models subdirectory and as the default value of this required initialization parameter. The PR has the following runtime parameters.
- annotationType
- If this is supplied, the PR classifies the text underlying each annotation of the specified type and stores the result as a feature on that annotation. If this is left blank (null or empty), the PR classifies the text of each document and stores the result as a document feature.
- annotationSetName
- The annotation set used for input and output; ignored if annotationType is blank.
- languageIdFeatureName
- The name of the document or annotation feature used to store the results.
Unlike most other PRs (which produce annotations), this one adds either document features or annotation features. (To classify both whole documents and spans within them, use two instances of this PR.) Note that classification accuracy is better over long spans of text (paragraphs rather than sentences, for example). More information on the languages supported can be found in the LingPipe documentation.
21.21 OpenNLP Plugin [#]
OpenNLP provides java-based tools for sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference. See the OpenNLP website for details: http://opennlp.sourceforge.net/. The tools use the Maxent machine learning package. See http://maxent.sourceforge.net/ for details.
In order to use these tools as GATE processing resources, load the ‘OpenNLP’ plugin via the Plugin Management Console. Alternatively, the OpenNLP system can be loaded from the GATE GUI by simply selecting the ‘Load OpenNLP System’ menu item under the ‘File’ menu. The OpenNLP PRs will be loaded, together with a pre-configured corpus pipeline application containing these PRs. This is similar to loading the ANNIE application with default values.
We have integrated the following five processing resources:
- OpenNlpTokenizer Tokenizer PR
- OpenNlpSentenceSplit Sentence Splitter PR
- OpenNlpPOS POS Tagger PR
- OpenNlpNameFinder NER PR
- OpenNlpChunker Chunker PR
Note that in general, these PRs can be mixed with other PRs of similar types. For example, you could create a pipeline that uses the OpenNLP Tokenizer, and the ANNIE POS Tagger. You may occasionally have problems with some combinations. Notes on compatibility and PR prerequisites are given for each PR in Section 21.21.2.
Note also that some of the OpenNLP tools use quite large machine learning models, which the PRs need to load into memory. You may find that you have to give additional memory to GATE in order to use the OpenNLP PRs comfortably. See the FAQ on the GATE Wiki for an example of how to do this.
Below, we describe the parameters common to all of the OpenNLP PRs. This is followed by a section which gives brief details of each PR. For more details on each, see the OpenNLP website, http://opennlp.sourceforge.net/.
21.21.1 Parameters common to all PRs [#]
Load-time Parameters [#]
All OpenNLP PRs have a ‘model’ parameter, which takes a URL. The URL should reference a valid Maxent model, or in the case of the Name Finder a directory containing a set of models for the different types of name sought. Default models can be found in the ‘models/english’ directory. In addition, the OpenNlpPOS POS Tagger PR has a ‘dictionary’ parameter, which also takes a URL, and a ‘dictionaryEncoding’ parameter giving the character encoding of the dictionary file. The default can be found in the ‘models/english’ directory.
For details of training new models (outside of the GATE framework), see Section 21.21.3
Run-time Parameters [#]
The OpenNLP PRs have runtime parameters to specify the annotation sets they should use for input and/or output. These are detailed below in the description of each PR, but all PRs will use the default unnamed annotation set unless told otherwise.
21.21.2 OpenNLP PRs [#]
OpenNlpTokenizer - Tokenizer PR [#]
This PR adds Token annotations to the annotation set specified by the annotationSetName parameter.
This PR does not require any other PR to be run beforehand. It creates annotations of type Token, with a feature and value ‘source=openNLP’ and a string feature that takes the underlying string as its value.
OpenNlpSentenceSplit - Sentence Splitter PR [#]
This PR adds Sentence annotations to the annotation set specified by the annotationSetName parameter.
This PR does not require any other PR to be run beforehand. It creates annotations of type Sentence, with a feature and value ‘source=openNLP’. If the OpenNLP sentence splitter returns no output (i.e. fails to find any sentences), then GATE will add a single sentence annotation from offset 0 to the end of the document. This is to prevent downstream PRs that require sentence annotations from failing.
OpenNlpPOS - POS Tagger PR [#]
This PR adds a feature for Part Of Speech to Token annotations.
This PR requires Sentence and Token annotations to be present in the annotation set specified by its annotationSetName parameter before it will work. These Sentence and Token annotations do not have to be from another OpenNLP PR. They could, for example, be from the ANNIE PRs. This PR adds a ‘category’ feature to each Token, with the predicted Part Of Speech as value.
OpenNlpNameFinder - NER PR [#]
This PR finds standard Named Entities, adding them as annotations named for each named Entity type.
This PR requires Sentence and Token annotations to be present in the annotation set specified by its inputASName parameter before it will work. These Sentence and Token annotations do not have to be from another OpenNLP PR. They could, for example, be from the ANNIE PRs. You may find, however, that not all pairings of Tokenizer and Sentence Splitter will work successfully.
The Token annotations do not need to have a ‘category’ feature. In other words, you do not need to run a POS Tagger before using this PR.
This PR creates annotations for each named entity in the annotation set specified by the outputASName parameter, with a feature and value ‘source=openNLP’. It also adds a feature of ‘type’ to each annotation, with the same values as the annotation type. Types include:
- person
- organization
- location
- date
- time
- percentage
For full details of all types, see the OpenNLP website, http://opennlp.sourceforge.net/.
OpenNlpChunker - Chunker PR [#]
This PR finds noun, verb, and other chunks, adding their position as features Token annotations.
This PR requires Sentence and Token annotations to be present in the annotation set specified by the annotationSetName parameter before it will work. The Token annotations need to have a ‘category’ feature. In other words, you also need to run a POS Tagger before using this PR. The Sentence and Token annotations (and ‘category’ POS tag features) do not have to be from another OpenNLP PR. They could, for example, be from the ANNIE PRs.
This PR creates features of type ‘chunk’ for each token. The value of this feature define whether the token is at the beginning of a chunk, inside a chunk, or outside a chunk, using the standard BIO model. Example values and their interpretations are:
- B-NP Token is the first (beginning) token of a Noun Phrase
- I-NP Token is inside a Noun Phrase
- B-VP Token is the first (beginning) token of a Verb Phrase
- I-VP Token is inside a Verb Phrase
- O Token is outside any phrase
- B-PP Token is the first (beginning) token of a Prepositional Phrase
- B-ADVP Token is the first (beginning) token of an Adverbial Phrase
For full details of all chunk values, see the OpenNLP website, http://opennlp.sourceforge.net/.
21.21.3 Training new models [#]
Within the OpenNLP framework, new models can be trained for each of the tools. By default, the GATE PRs use the standard Maxent models for English which can be found in the plugin’s ‘models/english’ directory. The models are copies of those in the "models" module in the OpenNLP CVS repository at http://opennlp.cvs.sourceforge.net. If you need to train a different model for an OpenNLP PR, you will have to do this outside of GATE, and then use the file URL of your new model as a value for the ‘model’ parameter of the PR. For details on how to train models, see the Maxent website http://maxent.sourceforge.net/.
21.22 Content Detection Using Boilerpipe [#]
When working in a closed domain it is often possible to craft a few JAPE rules to separate real document content from the boilerplate headers, footers, menus, etc. that often appear, especially when dealing with web documents. As the number of document sources increases, however, it becomes difficult to separate content from boilerplate using hand crafted rules and a more general approach is required.
The ‘Tagger_Boilerpipe’ plugin contains a PR that can be used to apply the boilerpipe library (see http://code.google.com/p/boilerpipe/) to GATE documents in order to annotate the content sections. The boilerpipe library is based upon work reported in [Kohlschütter et al. 10], although it has seen a number of improvements since then. Due to the way in which the library works not all features are currently available through the GATE PR.
The PR is configured using the following runtime parameters:
- allContent: this parameter defines how the mime type parameter should be interpreted and if documents should, instead of being processed, by assumed to contain nothing but actual content. defaults to ‘If Mime Type is NOT Listed’ which means that any document with a mime type not listed is assumed to be all content.
- annotateBoilerplate: should we annotate the boilerplate sections of the document, defaults to false.
- annotateContent: should we annotate the main content of the document, defaults to true.
- boilerplateAnnotationName: the name of the annotation type to annotate sections determined to be boilerplate, defaults to ‘Boilerplate’. Whilst this parameter is optional it must be specified if annotateBoilerplate is set to true.
- contentAnnotationName: the name of the annotation type to annotate sections determined to be content, defaults to ‘Content’. Whilst this parameter is optional it must be specified if annotateContent is set to true.
- debug: if true then annotations created by the PR will contain debugging info, defaults to false.
- extractor: specifies the boilerpipe extractor to use, defaults to the default extractor.
- failOnMissingInputAnnotations: if the input annotations (Tokens) are missing should this PR fail or just not do anything, defaults to true to allow obvious mistakes in pipeline configuration to be captured at an early stage.
- inputASName: the name of the input annotation set
- mimeTypes: a set of mime types that control document processing, defaults to text/html. The exact behaviour of the PR is dependent upon both this parameter and the value of the allContent parameter.
- ouputASName: the name of the output annotation set
- useHintsFromOriginalMarkups: often the original markups will provide hints that may be useful for correctly identifying the main content of the document. If true, useful markup (currently the title, body, and anchor tags) will be used by the PR to help detect content, defaults to true.
21.23 Inter Annotator Agreement
The IAA plugin, “Inter_Annotator_Agreement”, computes interannotator agreement measures for various tasks. For named entity annotations, it computes the F-measures, namely Precision, Recall and F1, for two or more annotation sets. For text classification tasks, it computes Cohen’s kappa and some other IAA measures which are more suitable than the F-measures for the task. This plugin is fully documented in Section 10.5. Chapter 10 introduces various measures of interannotator agreement and describes a range of tools provided in GATE for calculating them.
21.24 Schema Annotation Editor
The plugin ‘Schema_Annotation_Editor’ constrains the annotation editor to permitted types. See Section 3.4.6 for more information.
1Java string escape sequences such as \t will be decoded before the template is expanded.
2see http://docs.oracle.com/javase/6/docs/technotes/guides/net/proxies.html
3see http://alias-i.com/lingpipe/
4http://alias-i.com/lingpipe/web/models.html
5http://alias-i.com/lingpipe/demos/tutorial/posTags/read-me.html