Appendix A
Change Log [#]
This chapter lists major changes to GATE (currently Developer and Embedded only) in roughly chronological order by release. Changes in the documentation are also referenced here.
It was brought to our attention that in versions 9.0.1 and below there was a very small chance that the GUI action “Export for GATE Cloud” could be compromised. This would have required malicious code to be running locally on the machine; either by another user on a multi-user machine or because the computer had already been compromised. This issue only occurred within the GUI action and did not affect API use of the gate-core Maven artifact. Note that no known exploits exist for this issue, and we do not know for certain that the code could be exploited. If, however, you are at all concerned then we suggest you regenerate any packaged applications using a recent version of GATE Developer; at minimum 9.2-SNAPSHOT built on or after the 10th of August 2022.
A.1 Version 9.0.1 (March 2021) [#]
GATE Developer 9.0.1 is a bugfix release – the only change is to the way URL redirects are handled when loading a document. Support for following redirects from http to https was added in 9.0 which, while correct, broke the way URLs were used within GCP. This release fixes that bug and adds some additional security checking to the redirect handling.
A.2 Version 9.0 (February 2021) [#]
Whilst the majority of changes in GATE Developer 9.0 are small a number of them change default behaviour (in the UI or API) hence the change in version number. These changes include:
-
We now recommend users install a 64 bit version of Java whenever possible. This seems to be especially important on Windows.
-
We now default to assuming documents are UTF-8 encoded unless you specify otherwise. In previous versions if no encoding was specified GATE would use the default platform encoding, but this seemed to cause more problems than it solved (especially for Windows users). If you want the old behaviour then ensure the encoding parameter is set to the empty string when creating a document.
-
GATE uses a library called XStream for saving and loading GATE XML documents and applications. This allows us to store features of any Java type, but that can be abused by maliciously crafted files. In general use this is unlikely to be a problem, but in situations where GATE may be used as part of a service with no way of vetting input files it could present a serious security threat. XStream now offers a security framework to restrict the types of objects that can be loaded/saved. This can work either by allowing only specific types or by preventing specific types from being used. As we often do not know in advance what features might be used we have opted to use a minimal blacklist as the default security setting. This blocks the Java classes known to be exploitable. This can be further configured via calls to Gate.setXStreamSecurity() and we strongly encourage developers who depend on gate-core within larger applications to configure this based on their specific use cases.
-
Developers wishing to build GATE from source need to use Maven v3.6.0 or above.
-
Previous versions of GATE used Log4J for some of the logging. This was problematic when using gate-core as a dependency in larger projects and was awkward to configure properly. In this release we’ve switched to using SLF4J allowing the actual logging back-end to be configured independently. Plugins and code compiled against previous versions of GATE should work with the new release without change (we include the log4j-over-slf4j bridge as a dependency), although Log4J specific methods within gate-core have been deprecated and may be removed in a future release.
Many bugs have been fixed and documentation improved, in particular:
-
the Twitter plugin has been improved to make better use of the information provided by Twitter within a JSON Tweet object. The Hashtag tokenizer has been updated to provide a tokenized feature to make grouping semantically similar hashtags easier. Lots of other minor improvements and efficiency changes have been made throughout the rest of the TwitIE pipelines.
-
the ANNIE gazetteers have been updated to better support different ways of referring to countries and a blacklist option to prevent things being wrongly annotated.
-
A new addition to the JAPE syntax allows you to copy all features from a matched annotation to the new annotation being created
-
the Format_CSV plugin now allows the document cell to be interpreted as being a URL pointing to the document to load rather than the contents of the document. See Section 23.33 for more details.
A.3 Version 8.6.1 (January 2020) [#]
GATE Developer 8.6.1 is a bugfix release – the only change is to adjust for the fact that the Central Maven repository has been switched from http to https.
A.4 Version 8.6 (June 2019) [#]
GATE Developer 8.6 is mainly a maintenance and stability release, but there are some important new features, in particular around the processing of Twitter data:
-
The Format_Twitter plugin can now correctly handle extended 280 character tweets and the latest Twitter JSON format. See Section 17.2 for full details.
-
The new Format_JSON plugin provides import/export support for GATE JSON. This is essentially the old style Twitter format, but it no longer needs to track changes to the Twitter JSON format so should be more suitable for long term storage of GATE documents as JSON files. See Section 23.30 for more details. This plugin makes use of a new mechanism whereby document format parsers can take parameters via the document MIME type, which may be useful to third party formats too.
Many bugs have been fixed and documentation improved, in particular:
-
The plugin loading mechanism now properly respects the user’s Maven settings.xml:
-
HTTP proxy and “mirror” repository settings now work properly, including authentication. Also plugin resolution will now use the system proxy (if there is one) by default if there is no proxy specified in the Maven settings.
-
The “offline” setting is respected, and will prevent GATE from trying to fetch plugins from remote repositories altogether – for this to work, all the plugins you want to use must already be cached locally, or you can use “Export for GATE Cloud” to make a self-contained copy of an application including all its plugins.
-
-
Upgraded many dependencies including Tika and Jackson to avoid known security bugs in the previous versions.
-
Documentation improvements for the Kea plugin, the Corpus QA and annotation diff tools, and the default GATE XML and inline XML formats (section 3.9.1)
-
For plugin developers, the standard plugin testing framework generates a report detailing all the plugin-to-plugin dependencies, including those that are only expressed in the plugin’s example saved applications (section 7.12.1).
Some obsolete plugins have been removed (Websphinx web crawler, which depends on an unmaintained library, and the RASP parser, whose external binary is no longer available for modern operating systems), and there are many smaller bug fixes and improvements.
Note: following changes to Oracle’s JDK licensing scheme, we now recommend running GATE using the freely-available OpenJDK. The AdoptOpenJDK project offers simple installers for all major platforms, and major Linux distributions such as Ubuntu and CentOS offer OpenJDK packages as standard. See section 2.2 for full installation instructions.
A.5 Version 8.5.1 (June 2018) [#]
Version 8.5.1 is a minor release to fix a few critical bugs in 8.5:
-
Fixed an exception that prevented the ANNIC search GUI from opening.
-
Fixed a problem with “Export for GATE Cloud” that meant some resources were not getting included in the output ZIP file.
-
Fixed the XML schema in the gate-spring library.
A.6 Version 8.5 (May 2018) [#]
GATE Developer and Embedded 8.5 introduces a number of significant internal changes to the way plugins are managed, but with the exception of the plugin manager most users will not see significant changes in the way they use GATE.
-
The GATE plugins are no longer bundled with the GATE Developer distribution, instead each plugin is downloaded from a repository at runtime, the first time it is used. This means the distribution is much smaller than previous versions.
-
Most plugins are now distributed as a single JAR file through the Java-standard “Central Repository”, and resource files such as gazetteers and JAPE grammars are bundled inside the plugin JAR rather than being separate files on disk. If you want to modify the resources of a plugin then GATE provides a tool to extract an editable copy of the files from a plugin onto your disk – it is no longer possible to edit plugin grammars in place.
-
This makes dependencies between plugins much easier to manage – a plugin can specify its dependencies declaratively by name and version number rather than by fragile relative paths between plugin directories.
GATE 8.5 remains backwards compatible with existing third-party plugins, though we encourage you to convert your plugins to the new style where possible.
Further details on these changes can be found in sections 3.5 (the plugin manager in GATE Developer), 7.3 (loading plugins via the GATE Embedded API), 7.12 (creating a new plugin from scratch), and 7.20 (converting an existing plugin to the new style).
If you have an existing saved application from GATE version 8.4.1 or earlier it will be necessary to “upgrade” it to use the new core plugins. An upgrade tool is provided on the “Tools” menu of GATE Developer, and is described in section Section 3.9.5.
A.6.1 For developers
As part of this release, GATE development has moved from SourceForge to GitHub – bug reports, patches and feature requests should now use the GitHub issue tracker as described in section 12.1.
A.7 Version 8.4.1 (June 2017) [#]
This is a minor release that fixes one rarely encountered but serious bug with the handling of CDATA sections within the text content of GATE XML format documents. CDATA has always been handled correctly in annotation and document feature values, this bug would only affect a small number of documents where the text contains many less-than signs (<<<) and few annotations. In particular, annotated documents that have been processed using the GATE tokeniser are extremely unlikely to be affected as each less-than sign is treated as a separate Token annotation.
This release also includes one small improvement to the Twitter hashtag tokeniser so it recognises the names of some political parties when they occur within hashtags such as #VoteLabour.
A.8 Version 8.4 (February 2017) [#]
GATE Developer and Embedded 8.4 is mainly a bug fix release, with a small number of critical fixes compared to version 8.3. This will be the final major release of GATE before major re-structuring of the codebase and the plugin system for version 8.5.
-
Fixed an issue which had prevented the use of Java 8 lambda expressions in the RHS of JAPE rules, even when running on Java 8.
-
Removed OpenCalais and Zemanta plugins as the web services they depend on have changed and the plugins no longer work.
-
Fixed a bug that could cause the searchable datastore GUI to freeze.
-
Fixes to the TermRaider and Hindi sample applications
A.8.1 Java compatibility
For GATE 8.4 we recommend the use of the latest Java 8 from Oracle. If you are still restricted to Java 7, most components will still work with the exception of the Stanford CoreNLP tools and the TwitIE application (which uses the Stanford POS tagger). Future versions of GATE will require Java 8 as a minimum.
A.9 Version 8.3 (January 2017) [#]
GATE Developer and Embedded 8.3 is mainly a bug fix release, with several critical fixes and functionality improvements.
-
JAPE grammars can now match and create annotation types and features with spaces or punctuation in their names, by using double quotes around the type or feature name (e.g. {"w:p"}).
-
Fixed a regression in 8.2 that meant saved application states and “export for GATE Cloud” packages created on Windows would not load on other platforms.
-
Fixed a bug in the Stanford CoreNLP plugin which would sometimes fail when GATE is installed in a directory whose path contains spaces.
-
Various improvements to the Twitter normaliser and emoticon finder.
-
Improvements to the Lang_French and Lang_German components. Further improvements will follow in the next release.
-
Improvement to the Crowd_Sourcing plugin to allow a default option to be specified for clasification jobs.
-
Fixed Java version detection in the Windows EXE launcher.
-
Detection of document format using clues from the content is now much more efficient.
-
Fixed some GUI deadlocks in the searchable data store GUI and the plugin manager.
-
Fixed a long-standing bug in the regex sentence splitter for documents with long sequences of blank lines.
-
Removed Minipar parser plugin as the data files on which it depends are no longer available for download, and the Tagger_NormaGene plugin as the service it relies on is no longer online.
Plus the usual suite of miscellaneous smaller bug fixes.
A.9.1 Java compatibility
For GATE 8.3 we recommend the use of the latest Java 8 from Oracle. If you are still restricted to Java 7, most components will still work with the exception of the Stanford CoreNLP tools and the TwitIE application (which uses the Stanford POS tagger). Future versions of GATE will require Java 8 as a minimum.
A.10 Version 8.2 (May 2016) [#]
GATE Developer and Embedded 8.2 is mainly a bug fix release – there are a few new plugins but the emphasis is on bug fixing and library updates.
-
New tools for temporal expression and event detection, including a wrapper for the HeidelTime tagger (section 23.38).
-
New language plugins for Danish and Welsh named entitiy recognition.
-
Performance improvements in the ANNIE NER system, in particular to deal better with hyphenated names and titles.
-
Improvements to TermRaider to support GATE documents that contain many independent sections (e.g. web forums, lists of tweets).
-
Bug fixes in the handling of Twitter JSON data – GATE now has full round-trip support for Twitter JSON, tweets can be loaded, annotated, and saved back to the same format accurately. The JSON format parser has been separated from the rest of the Twitter plugin, making it easier to add JSON support to non-TwitIE applications.
-
Updated dependencies – the Stanford_CoreNLP plugin now uses version 3.6.0 of Stanford CoreNLP, and the Groovy plugin uses Groovy version 2.4.4
-
GCP input and output handlers added to the Format_CSV plugin.
Plus the usual suite of miscellaneous smaller bug fixes.
A.10.1 Java compatibility
For GATE 8.2 we recommend the use of the latest Java 8 from Oracle. If you are still restricted to Java 7, most components will still work with the exception of the Stanford CoreNLP tools. Future versions of GATE will require Java 8 as a minimum.
A.11 Version 8.1 (June 2015) [#]
A.11.1 New plugins and significant new features
-
Integration of the Stanford NER tools – all the Stanford tools in GATE have been brought together under a single Stanford_CoreNLP plugin.
-
Improved crowdsourcing tools (chapter 21), including tools to perform automatic adjudication of multiply-annotated documents.
-
Parsers for new document formats, including the DataSift format (section 23.32) for social media data, and an improved Twitter JSON parser (section 17.2) which can import the standoff annotations Twitter themselves provide (hashtags, etc.).
-
Improved support for data export, making it easy for plugins to add their own export formats accessible through the GUI and the API. New exporters are provided for the Twitter JSON format (section 17.3) and a more configurable inline XML format.
-
A new plugin for simplifying sentences using linguistic rules and other information (section 23.37), contributed by the ForgetIT project.
A.11.2 Library updates and bugfixes
-
Apache Tika (for parsing PDF, MS Word, etc.) updated to version 1.7.
-
ASM (for processing CREOLE metadata) updated to version 5.0.3. This allows the use of Java 8 language features such as lambdas in third-party plugins, though GATE itself remains compatible with Java 7.
-
Stanford CoreNLP tools updated to version 3.4 (the latest version that is compatible with Java 7).
-
MetaMap libraries (for UMLS) updated to version 2014.
-
Better support in the GATE Unicode Tokeniser for scripts that use supplementary characters beyond the basic 16 bit range.
-
Bugfixes in the segment processing PR.
A.11.3 Tools for developers
Three new tools have been added to the Developer_Tools plugin:
-
Menu options to produce Java heap dumps to aid debugging.
-
Menu option to dynamically increase the Log4J logging verbosity at runtime.
-
A tool that attempts to unload all plugins that are loaded but not currently in use.
Other changes that benefit developers include:
-
New helper methods in the gate.Utils class.
-
The gate.util.ProcessManager API has been extended to allow an external process to be kept running. This is especially useful for running some external tools which have very long starup times yet can be reused across documents.
-
Some changes in the management of classloaders that should reduce the potential for deadlocks.
…and as always, a range of smaller improvements and bug fixes.
A.12 Version 8.0 (May 2014) [#]
GATE 8.0 is a major release which brings some major new features, many new and updated plugins, and significant under-the-bonnet changes to GATE Embedded.
A.12.1 Major changes
Java 7 required
GATE 8.0 requires Java 7 or later to run.
Tools for Twitter
A new “Twitter” plugin provides tools dedicated to Twitter data:
-
format parsers to handle Tweets in the JSON formats produced by the Twitter APIs
-
Twitter-specific components such as a tokeniser and POS tagger
-
the TwitIE named entity annotation pipeline.
See section 17.1 for full details.
ANNIE Refreshed
The ANNIE named entity annotation pipeline which has been the mainstay of many GATE applications for many years has been brought up to date, with new gazetteers and improved JAPE grammars giving improved precision and recall on common test corpora.
Tools for Crowd Sourcing
A new Crowd_Sourcing plugin provides facilities to support generation of manually annotated corpora via the CrowdFlower crowdsourcing platform1. The plugin provides support for two different kinds of tasks, general entity annotation (e.g. determining which words in a given sentence are person names) and entity linking (e.g. for ontology-based annotation, where the spans of the entities are known but not which particular ontology instance each annotation corresponds to). Using crowdsourcing you can generate multiply-annotated gold standard corpora rapidly and at relatively low cost. For full details see chapter 21.
A.12.2 Other new and improved plugins
-
New language plugins to support Russian and Bulgarian
-
Integration of the Stanford POS Tagger (section 23.22), which is used by TwitIE
-
A document normalizer plugin, predominantly to normalize punctuation such as Microsoft Word “ssec:creole:datasiftmart quotes” (see section 23.35)
-
Wrappers for the AlchemyAPI keyword and entity extraction services (in the AlchemyAPI plugin)
-
Wrapper for the TextRazor annotation service (see section 23.5).
-
New document format parser to populate a GATE corpus from one or more CSV files (see section 23.33).
-
Support for loading and saving GATE XML files in the binary FastInfoset format (see section 23.29).
-
Various improvements to the Learning plugin, in particular to support numeric and boolean features
-
Improvements to the TermRaider term extraction plugin (see section 23.34)
-
The OntoRoot Gazetteer (in the Gazetteer_Ontology_Based plugin) now supports tokenisers and POS taggers other than the default ANNIE PR types, making it possible to use other preprocessing tools for non-English data
-
Further improvements to the classloading model to better isolate plugins from one another.
-
A new enableDebugging runtime parameter for JAPE grammars will add additional features to every generated annotation detailing which rule was responsible for creating the annotation.
A.12.3 Bug fixes and other improvements
-
The annotation schema LR type is now available by default without the need to load any plugins. The schemas that were previously loaded by default by the ANNIE plugin must now be loaded explicitly if you require them (section 3.4.6). Annotation schemas now support the include element, so multiple schemas can be loaded by loading a single master file.
-
The segment processing PR (section 20.2.10) now preserves annotation IDs, allowing ID-sensitive tools such as coreference to work properly.
A.12.4 For developers
Changes of note for users of the GATE Embedded APIs include:
-
A new data model to represent relations between annotations, see section 7.7 for details. The Coref_Tools plugin has been retrofitted to use this new model to represent coreference chains. Relation information is preserved when saving documents as GATE XML, but note that such documents will not be compatible with older versions of GATE.
-
A new “resource helper” mechanism allows plugins to contribute additional actions to existing resource types, both in the Developer GUI (section 4.8.2) and in the Embedded API (section 7.19)
-
A new class gate.corpora.DocumentJsonUtils provides methods to export a GATE document in a JSON format compatible with that used by Twitter. See the JavaDoc documentation for details.
-
Many deprecated classes, fields and methods have been removed. If you were previously calling any of these deprecated APIs you will need to update your code accordingly. Also some classes in the GATE core that were only used by one plugin have been moved into the respective plugin’s source tree. In particular, Java RHS actions in JAPE rules no longer provide the long-deprecated annotations variable – use inputAS or outputAS as appropriate.
-
Many library dependencies have been updated to more recent versions.
-
The GATE APIs make much wider use of generics than previously – many places in the code that previously used raw types are now properly generic
-
A new Developer_Tools plugin (section 23.36) provides utilities to assist in debugging applications in GATE Developer.
If you are working on the core GATE source code, note that:
-
the source tree has been split into “main” and “test”, isolating the test classes from the rest of the source
-
each plugin is now a separate Eclipse “project”, and the main project is just the core sources, which makes it easier to control dependencies among the different parts
-
dependencies are no longer checked in to subversion, instead they are fetched at build time from the Maven central repository by Apache Ivy.
A.13 Version 7.1 (November 2012) [#]
A.13.1 New plugins
The TermRaider plugin (see Section 23.34) provides a toolkit and sample application for term extraction.
Two new plugins, Tagger_Zemanta (since removed) and Tagger_Lupedia provide PRs that wrap online annotation services provided by Zemanta and Ontotext.
A new plugin named Coref_Tools includes a framework for fast co-reference processing, and one PR that performs orthographical co-reference in the style of the ANNIE Orthomatcher. See Section 23.26 for full details.
A new Configurable Exporter PR in the Tools plugin, allowing annotations and features to be exported in formats specified by the user (e.g. for use with external machine learning tools). See Section 23.12 for details.
Support for reading a number of new document formats has also been added:
-
PubMed and the Cochrane Library formats (see Section 23.27).
-
CoNLL “IOB” format (see Section 5.5.10).
-
MediaWiki markup, both plain text and XML dump files such as those from Wikipedia (see Section 23.28).
In addition, “ready-made applications” have been added to many existing plugins (notably the Lang_* non-English language plugins) to make it easier to experiment with their PRs.
A.13.2 Library updates
Updated the Stanford Parser plugin (see Section 18.2) to version 2.0.4 of the parser itself, and added run-time parameters to the PR to control the parser’s dependency options.
The Measurement and Number taggers have been upgraded to use JAPE+ instead of JAPE. This should result in faster processing, and also allows for more memory efficient duplication of PR instances, i.e. when a pool of applications is created.
The OpenNLP plugin has been completely revised to use Apache OpenNLP 1.5.2 and the corresponding set of models. See Section 23.21 for details.
The native launcher for GATE on Mac OS X now works with Oracle Java 7 as well as Apple Java 6.
A.13.3 GATE Embedded API changes
Some of the most significant changes in this version are “under the bonnet” in GATE Embedded:
-
The class loading architecture underlying the loading of plugins and the generation of code from JAPE grammars has been re-worked. The new version allows for the complete unloading of plugins and for better memory handling of generated classes. Different plugins can now also use different versions of the same 3rd party libraries. There have also been a number of changes to the way plugins are (un)loaded which should provide for more consistent behaviour.
-
The GATE XML format has been updated to handle more value types (essentially every data type supported by XStream ( http://xstream.codehaus.org/faq.html) should be usable as feature name or value. Files in the new format can be opened without error by older GATE versions, but the data for the previously-unsupported types will be interpreted as a String, containing an XML fragment.
-
The PRs defined in the ANNIE plugin are now described by annotations on the Java classes rather than explicitly inside creole.xml. The main reason for this change is to enable the definitions to be inherited to any subclasses of these PRs. Creating an empty subclass is a common way of providing a PR with a different set of default parameters (this is used extensively in the language plugins to provide custom gazetteers and named entity transducers). This has the added benefit of ensuring that new features also automatically percolate down to these subclasses. If you have developed your own PR that extends one of the ANNIE ones you may find it has acquired new parameters that were not there previously, you may need to use the @HiddenCreoleParameter annotation to suppress them.
-
The corpus parameter of LanguageAnalyser (an interface most, if not all, PRs implement) is now annotated as @Optional as most implementations do not actually require the parameter to be set.
-
When saving an application the plugins are now saved in the same order in which they were originally loaded into GATE. This ensures that dependencies between plugins are correctly maintained when applications are restored.
-
API support for working with relations between annotations was added. See Section 7.7 for more details.
-
The method of populating a corpus from a single file has been updated to allow any mime type to be used when creating the new documents.
And numerous smaller bug fixes and performance improvements…
A.14 Version 7.0 (February 2012) [#]
A.14.1 Major new features
The CREOLE Plugin Manager has been completely re-written and now includes support for installing new plugins from remote update sites. See section 3.6 for more details. In addition, plugins can now contribute additional “ready-made applications” to the GATE Developer menus alongside the standard applications (ANNIE, etc.). Details can be found in section 12.3.4.
A new plugin named JAPE_Plus has been added. It contains a new JAPE execution engine that includes various optimisations and should be significantly faster than the standard engine. JAPE_Plus has not yet been comprehensively tested, so it should be considered beta software, and used with caution. See Section 8.11 for more details.
A new Java-based launcher has been implemented which now replaces the use of Apache ANT for starting-up GATE Developer. The GATE Developer application now behaves in a more natural way in dock-based desktop environments such as Mac OS X and Ubuntu Unity.
Improved the support for processing biomedical text by adding new PRs to incorporate the following tools: AbGene, the NormaGene tagger, the GENIA sentence splitter, MutationFinder and the Penn BioTagger (contains a tokenizer and three taggers for gene, malignancy and variation). For full details of these new resources see section 16.1.
The Flexible Gazetteer PR has been rewritten to provide a better and faster implementation. The two parameters inputAnnotationSetName and outputAnnotationSetName have been renamed to inputASName and outputASName, however old applications with the old parameters should still work. Please see Section 13.6 for more details.
A.14.2 Removal of deprecated functionality
Various components were removed in this release as they have been unsupported and deprecated in previous releases:
-
the GATE Unicode Kit (GUK), which has been superseded by improved native support for localisation in the various target operating systems. If you still require GUK it is available as a separate software project at http://gate.svn.sourceforge.net/viewvc/gate/guk/trunk.
-
the database-backed datastore implementation.
-
the plugins Jape_Compiler (superseded by JAPE_Plus) and Ontology_OWLIM2.
In addition the Web_Search_Google, Web_Search_Yahoo and Web_Translate_Google plugins have been removed as the underlying web services on which they depend are no longer available. Documentation for obsolete plugins can be found in appendix C, and if you require any of them for your application please see plugins/Obsolete/README.TXT in the GATE Developer distribution.
A.14.3 Other enhancements and bug fixes
CREOLE plugins can now use Apache Ivy to include third-party dependencies. See section 4.7 for details.
The Default ANNIE Gazetteer now allows a user to specify different annotation types to be used for annotating entries from different lists. For example, a user may want to find city names mentioned in a gazetteer list (e.g. city.lst) and annotate the matching strings as City. Please see section 6.3 for more details.
The Segment Processing PR has two additional run-time parameters called segmentAnnotationFeatureName and segmentAnnotationFeatureValue. These features allow users to specify a constraint on feature name and feature value. If user has provided values for these parameters, only the annotations with the specified feature name and feature value are processed with the Segment Processing PR. Also, the parameter controller has been renamed to analyser which means the Segment Processing PR can now also run an individual PR on the specified segments2. See 20.2.10 for more information on section-by-section processing.
The Hash Gazetteer (section 13.5) now properly supports the caseSensitive parameter (previously the parameter could be set but had no effect).
The Document Reset PR (Section 6.1) now defaults to keeping the Key set as well as Original markups. This makes working with pre-annotated gold standard document less dangerous (assuming you put the gold standard annotations in a set called Key).
Updated Stanford Parser plugin (see Section 18.2) to version 1.6.8.
The TextCat based Language Identification PR now supports generating new language fingerprints. See section 15.1 for full details.
Added support for reading XCAS and XMI-format documents created by UIMA. See section 5.5.9 for details.
Various improvements to the GATE Developer GUI:
-
added support in the document editor to switch the principal text orientation, to better support documents written in right-to-left languages such as Arabic, Hebrew or Urdu (section 3.2).
-
added new mouse shortcuts to the Annotation Stack view in the document editor to speed up the curation process (section 3.4.3).
-
the document editor layout is now saved to the user preferences file, gate.xml. It means that you can give this file to a new user so s/he will have a preconfigured document editor (section 3.2).
-
the script behind an instance of the Groovy Scripting PR (section 7.16.2) can now be edited from within GATE Developer through a new visual resource which supports syntax highlighting.
The rule and phase names are now accessible in a JAPE Java RHS by the ruleName() and phaseName() methods and the name of the JAPE processing resource executing the JAPE transducer is accessible through the action context getPRName() method. See section 8.6.5.
A.15 Version 6.1 (April 2011) [#]
A.15.1 New CREOLE Plugins
Tagger_Numbers to annotate many kinds of numbers in documents and determine their numeric values. The tagger can annotate numbers expressed in many forms including Arabic and Roman numerals, words (in English, French, German and Spanish) and scientific notation (4.3e6 = 4300000). See section 23.6 for full details.
Tagger_Measurements to annotate many different forms of measurement expressions (“5.5 metres”, “1 minute 30 seconds”, “10 to 15 pounds”, etc.) along with their normalized values in SI units. See section 23.7 for full details.
Tagger_Boilerpipe, which contains a boilerpipe3 based PR for performing content detection. See section 23.23 for full details.
Tagger_DateNormalizer to annotate and normalize dates within a document. See section 23.8 for full details.
Schema_Tools providing a “Schema Enforcer” PR that can be used to create a clean output annotation set based on a set of annotation schemas. See section 23.14 for full details.
Teamware_Tools providing a new PR called QA Summariser for Teamware. When documents are annotated using GATE Teamware, this PR can be used for generating a summary of agreements among annotators. See section 10.7 for full details.
Tagger_MetaMap has been rewritten to make use of the new MetaMap Java API features. There are numerous performance enhancements and bug fixes detailed in section 16.1.2. Note that this version of the plugin is not compatible with the version provided in GATE 6.0, though this earlier version is still available in the Obsolete directory if required.
A.15.2 Other new features and improvements
Added support for handling controller events to JAPE by making it possible to define ControllerStarted, ControllerFinished, and ControllerAborted code blocks in a JAPE file (see section 8.6.5).
JAPE Java right-hand-side code can now access an ActionContext object through the predefined field ctx which allows access to the corpus LR and the transducer PR and their features (see section 8.6.5).
Three new optional attributes can be specified in <GATECONFIG> element of gate.xml or local configuration file:
-
addNamespaceFeatures - set to “true” to deserialize namespace prefix and URI information as features.
-
namespaceURI - The feature name to use that will hold the namespace URI of the element, e.g. “namespace”
-
namespacePrefix - The feature name to use that will hold the namespace prefix of the element, e.g. “prefix”
Setting these attributes will alter GATE’s default namespace deserialization behaviour to remove the namespace prefix and add it as a feature, along with the namespace URI. This allows namespace-prefixed elements in the Original markups annotation set to be matched with JAPE expressions, and also allows namespace scope to be added to new annotations when serialized to XML. See 5.5.2 for details.
Searchable Serial Datastores (Lucene-based) are now portable and can be moved across different systems. Also, several GUI improvements have been made to ease the creation of Lucene datastores. See chapter 9 for details.
The populate method that allowed populating corpus from a trecweb file has been made more generic to accept a tag. The method extracts content between the start and end of this tag to create new documents. In GATE Developer, right-clicking on an instance of the Corpus and choosing the option “Populate from Single Concatenated File" allows users to populate the corpus using this functionality. See Section 7.4.5 for more details.
Fixed a regression in the JAPE parser that prevented the use of RHS macros that refer to a LHS label (named blocks :label { ... } and assignments :label.Type = {}
Enhanced the Groovy scriptable controller with some features inspired by the realtime controller, in particular the ability to ignore exceptions thrown by PRs and the ability to limit the running time of certain PRs. See section 7.16.3 for details.
The Ontology and Gazetteer_LKB plugins have been upgraded to use Sesame 3.2.3 and OWLIM 3.5.
The Websphinx Crawler PR has new runtime parameters for controlling the maximum page size and spoofing the user-agent.
A few bug fixes and improvements to the “recover” logic of the packagegapp Ant task (see section E.2).
…and many other smaller bugfixes.
Note: As of version 6.1, GATE Developer and Embedded require Java 6 or later and will no longer run on Java 5. If you require Java 5 compatibility you should use GATE 6.0.
A.16 Version 6.0 (November 2010) [#]
A.16.1 Major new features
Added an annotation tool for the document editor: the Relation Annotation Tool (RAT). It is designed to annotate a document with ontology instances and to create relations between annotations with ontology object properties. It is close and compatible with the Ontology Annotation Tool (OAT) but focus on relations between annotations. See section 14.6 for details.
Added a new scriptable controller to the Groovy plugin, whose execution strategy is controlled by a simple Groovy DSL. This supports more powerful conditional execution than is possible with the standard conditional controllers (for example, based on the presence or absence of a particular annotation, or a combination of several document feature values), rich flow control using Groovy loops, etc. See section 7.16.3 for details.
A new version of Alignment Editor has been added to the GATE distribution. It consists of several new features such as the new alignment viewer, ability to create alignment tasks and store in xml files, three different views to align the text (links view and matrix view - suitable for character, word and phrase alignments, parallel view - suitable for sentence or long text alignment), an alignment exporter and many more. See chapter 20 for more information.
MetaMap, from the National Library of Medicine (NLM), maps biomedical text to the UMLS Metathesaurus and allows Metathesaurus concepts to be discovered in a text corpus. The Tagger_MetaMap plugin for GATE wraps the MetaMap Java API client to allow GATE to communicate with a remote (or local) MetaMap PrologBeans mmserver and MetaMap distribution. This allows the content of specified annotations (or the entire document content) to be processed by MetaMap and the results converted to GATE annotations and features. See section 16.1.2 for details.
A new plugin called Web_Translate_Google has been added with a PR called Google Translator PR in it. It allows users to translate text using the Google translation services. See section C.5 for more information.
New Gazetteer Editor for ANNIE Gazetteer that can be used instead of Gaze. It uses tables instead of text area to display the gazetteer definition and lists, allows sorting on any column, filtering of the lists, reloading a list, etc. See section 13.2.2.
A.16.2 Breaking changes
This release contains a few small changes that are not backwards-compatible:
-
Changed the semantics of the ontology-aware matching mode in JAPE to take account of the default namespace in an ontology. Now class feature values that are not complete URIs will be treated as naming classes within the default namespace of the target ontology only, and not (as previously) any class whose URI ends with the specified name. This is more consistent with the way OWL normally works, as well as being much more efficient to execute. See section 14.8 for more details.
-
Updated the WordNet plugin to support more recent releases of WordNet than 1.6. The format of the configuration file has changed, if you are using the previous WordNet 1.6 support you will need to update your configuration. See section 23.16 for details.
-
The deprecated Tagger_TreeTagger plugin has been removed, applications that used it will need to be updated to use the Tagger_Framework plugin instead. See section 23.3 for details of how to do this.
A.16.3 Other new features and bugfixes
The concept of templates has been introduced to JAPE. This is a way to declare named “variables” in a JAPE grammar that can contain placeholders that are filled in when the template is referenced. See section 8.1.6 for full details.
Added a JAPE operator to get the string covered by a left-hand-side label and assign it to a feature of a new annotation on the right hand side (see section 8.1.3).
Added a new API to the CREOLE registry to permit plugins that live entirely on the classpath. CreoleRegister.registerComponent instructs the registry to scan a single java Class for annotations, adding it to the set of registered plugins. See section 7.3 for details.
Maven artifacts for GATE are now published to the central Maven repository. See section 2.6.1 for details.
Bugfix: DocumentImpl no longer changes its stringContent parameter value whenever the document’s content changes. Among other things, this means that saved application states will no longer contain the full text of the documents in their corpus, and documents containing XML or HTML tags that were originally created from string content (rather than a URL) can now safely be stored in saved application states and the GATE Developer saved session.
A processing resource called Quality Assurance PR has been added in the Tools plugin. The PR wraps the functionality of the Quality Assurance Tool (section 10.3).
A new section for using the Corpus Quality Assurance from GATE Embedded has been written. See section 10.3.
The Generic Tagger PR (in the Tagger_Framework plugin) now allows more flexible specification of the input to the tagger, and is no longer limited to passing just the “string” feature from the input annotations. See section 23.3 for details.
Added new parameters and options to the LingPipe Language Identifier PR. (section 23.20.5), and corrected the documentation for the LingPipe POS Tagger (section 23.20.3).
In the document editor, fixed several exceptions to make editing text with annotations highlighted working. So you should now be able to edit the text and the annotations should behave correctly that is to say move, expand or disappear according to the text insertions and deletions.
Options for document editor: read-only and insert append/prepend have been moved from the options dialogue to the document editor toolbar at the top right on the triangle icon that display a menu with the options. See section 3.2.
Added new parameters and options to the Crawl PR and document features to its output.
Fixed a bug where ontology-aware JAPE rules worked correctly when the target annotation’s class was a subclass of the class specified in the rule, but failed when the two class names matched exactly.
Improved support for conditional pipelines containing non-LanguageAnalyser processing resources.
Added the current Corpus to the script binding for the Groovy Script PR, allowing a Groovy script to access and set corpus-level features. Also added callbacks that a Groovy script can implement to do additional pre- or post-processing before the first and after the last document in a corpus. See section 7.16 for details.
A.17 Version 5.2.1 (May 2010) [#]
This is a bugfix release to resolve several bugs that were reported shortly after the release of version 5.2:
-
Fixed some bugs with the automatic “create instance” feature in OAT (the ontology annotation tool) when used with the new Ontology plugin.
-
Added validation to datatype property values of the date, time and datetime types.
-
Fixed a bug with Gazetteer_LKB that prevented it working when the dictionaryPath contained spaces.
-
Added a utility class to handle common cases of encoding URIs for use in ontologies, and fixed the example code to show how to make use of this. See chapter 14 for details.
-
The annotation set transfer PR now copies the feature map of each annotation it transfers, rather than re-using the same FeatureMap (this means that when used to copy annotations rather than move them, the copied annotation is independent from the original and modifying the features of one does not modify the other). See section 23.13 for details.
-
The Log4J log files are now created by default in the .gate directory under the user’s home directory, rather than being created in the current directory when GATE starts, to be more friendly when GATE is installed in a shared location where the user does not have write permission.
This release also fixes some shortcomings in the Groovy support added by 5.2, in particular:
-
The corpora variable in the console now includes persistent corpora (loaded from a datastore) as well as transient corpora.
-
The subscript notation for annotation sets works with long values as well as ints, so someAS[annotation.start()..annotation.end()] works as expected.
A.18 Version 5.2 (April 2010) [#]
A.18.1 JAPE and JAPE-related
Introduced a utility class gate.Utils containing static utility methods for frequently-used idioms such as getting the string covered by an annotation, finding the start and end offsets of annotations and sets, etc. This class is particularly useful on the right hand side of JAPE rules (section 8.6.5).
Added type parameters to the bindings map available on the RHS of JAPE rules, so you can now do AnnotationSet as = bindings.get("label") without a cast (see section 8.6.5).
Fixed a bug with JAPE’s handling of features called “class” in non-ontology-aware mode. Previously JAPE would always match such features using an equality test, even if a different operator was used in the grammar, i.e. {SomeType.class != "foo"} was matched as {SomeType.class == "foo"}. The correct operator is now used. Note that this does not affect the ontology-aware behaviour: when an ontology parameter is specified, “class” features are always matched using ontology subsumption.
Custom JAPE operators and annotation accessors can now be loaded from plugins as well as from the lib directory (see section 8.2.5).
A.18.2 Other Changes
Added a mechanism to allow plugins to contribute menu items to the “Tools” menu in GATE Developer. See section 4.8 for details.
Enhanced Groovy support in GATE: the Groovy console and Groovy Script PR (in the Groovy plugin) now import many GATE classes by default, and a number of utility methods are mixed in to some of the core GATE API classes to make them more natural to use in Groovy. See section 7.16 for details.
Modified the batch learning PR (in the Learning plugin) to make it safe to use several instances in APPLICATION mode with the same configuration file and the same learned model at the same time (e.g. in a multithreaded application). The other modes (including training and evaluation) are unchanged, and thus are still not safe for use in this way. Also fixed a bug that prevented APPLICATION mode from working anywhere other than as the last PR in a pipeline when running over a corpus in a datastore.
Introduced a simple way to create duplicate copies of an existing resource instance, with a way for individual resource types to override the default duplication algorithm if they know a better way to deal with duplicating themselves. See section 7.8.
Enhanced the Spring support in GATE to provide easy access to the new duplication API, and to simplify the configuration of the built-in Spring pooling mechanisms when writing multi-threaded Spring-based applications. See section 7.15.
The GAPP packager Ant task now respects the ordering of mapping hints, with earlier hints taking precedence over later ones (see section E.2.3).
Bug fix in the UIMA plugin from Roland Cornelissen - AnalysisEnginePR now properly shuts down the wrapped AnalysisEngine when the PR is deleted.
Patch from Matt Nathan to allow several instances of a gazetteer PR in an embedded application to share a single copy of their internal data structures, saving considerable memory compared to loading several complete copies of the same gazetteer lists (see section 13.10).
In the corpus quality assurance, measures for classification tasks have been added. You can also now set the beta for the fscore. This tool has been optimised to work with datastores so that it doesn’t need to read all the documents before comparing them.
A.19 Version 5.1 (December 2009) [#]
Version 5.1 is a major increment with lots of new features and integration of a number of important systems from 3rd parties (e.g. LingPipe, OpenNLP, OpenCalais, a revised UIMA connector). We’ve stuck with the 5 series (instead of jumping to 6.0) because the core remains stable and backwards compatible.
Other highlights include:
-
an entirely new ontology API from Johann Petrak of OFAI (the old one is still available but as a plugin)
-
new benchmarking facilities for JAPE from Andrew Borthwick and colleagues at Intelius
-
new quality assurance tools from Thomas Heitz and colleagues at Ontotext and Sheffield
-
a generic tagger integration framework from René Witte of Concordia University
-
several new code contributions from Ontotext, including a large knowledge-based gazetteer and various plugin wrappers from Marin Nozchev, Georgi Georgiev and colleagues
-
a revised and reordered user guide, amalgamated with the programmers’ guide and other materials
-
Groovy support, application composition, section-by-section processing and lots of other bits and pieces
A.19.1 New Features
LingPipe Support
LingPipe is a suite of Java libraries for the linguistic analysis of human language. We have provided a plugin called ‘LingPipe’ with wrappers for some of the resources available in the LingPipe library. For more details, see the section 23.20.
OpenNLP Support
OpenNLP provides tools for sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference. The tools use Maximum Entropy modelling. We have provided a plugin called ‘OpenNLP’ with wrappers for some of the resources available in the OpenNLP Tools library. For more details, see section 23.21.
OpenCalais Support
We added a new PR called ‘OpenCalais PR’. This will process a document through the OpenCalais service, and add OpenCalais entity annotations to the document. (This plugin was subsequently removed in GATE 8.4)
Ontology API
The ontology API (package gate.creole.ontology has been changed, the existing ontology implementation based on Sesame1 and OWLIM2 (package gate.creole.ontology.owlim) has been moved into the plugin Ontology_OWLIM2. An upgraded implementation based on Sesame2 and OWLIM3 that also provides a number of new features has been added as plugin Ontology.
Benchmarking Improvements
A number of improvements to the benchmarking support in GATE. JAPE transducers now log the time spent in individual phases of a multi-phase grammar and by individual rules within each phase. Other PRs that use JAPE grammars internally (the pronominal coreferencer, English tokeniser) log the time taken by their internal transducers. A reporting tool, called ‘Profiling Reports’ under the ‘Tools’ menu makes summary information easily available. For more details, see chapter 11.
GUI improvements
To deal with quality assurance of annotations, one component has been updated and two new components have been added. The annotation diff tool has a new mode to copy annotations to a consensus set, see section 10.2.1. An annotation stack view has been added in the document editor and it allows to copy annotations to a consensus set, see section 3.4.3. A corpus view has been added for all corpus to get statistics like precision, recall and F-measure, see section 10.3.
An annotation stack view has been added in the document editor to make easier to see overlapping annotations, see section 3.4.3.
ABNER Support
ABNER is A Biomedical Named Entity Recogniser, for finding entities such as genes in text. We have provided a plugin called ‘AbnerTagger’ with a wrapper for ABNER. For more details, see section 16.1.1.
Generic Tagger Support
A new plugin has been added to provide an easy route to integrate taggers with GATE. The Tagger_Framework plugin provides examples of incorporating a number of external taggers which should serve as a starting point for using other taggers. See Section 23.3 for more details.
Section-by-Section Processing
We have added a new PR called ‘Segment Processing PR’. As the name suggests this PR allows processing individual segments of a document independently of one other. For more details, please look at the section 20.2.10.
Application Composition
The gate.Controller implementations provided with the main GATE distribution now also implement the gate.ProcessingResource interface. This means that an application can now contain another application as one of its components.
Groovy Support
Groovy is a dynamic programming language based on Java. You can now use it as a scripting language for GATE, via the Groovy Console. For more details, see Section 7.16.
A.19.2 JAPE improvements
GATE now produces a warning when any Java right-hand-sides in JAPE rules make use of the deprecated annotations parameter. All bundled JAPE grammars have been updated to use the replacement inputAS and outputAS parameters as appropriate.
The new Imports: statement at the beginning of a JAPE grammar file can now be used to make additional Java import statements available to the Java RHS code, see 8.6.5.
The JAPE debugger has been removed. Debugging of JAPE has been made easier as stack traces now refer to the JAPE source file and line numbers instead of the generated Java source code.
The Montreal Transducer has been made obsolete.
A.19.3 Other improvements and bug fixes
Plugin names have been rationalised. Mappings exist so that existing applications will continue to work, but the new names should be used in the future. Plugin name mappings are given in Appendix B. Also, the Segmenter_Chinese plugin (used to be known as chineseSegmenter plugin) is now part of the Lang_Chinese plugin.
The User Guide has been amalgamated with the Programmer’s Guide; all material can now be found in the User Guide. The ‘How-To’ chapter has been converted into separate chapters for installation, GATE Developer and GATE Embedded. Other material has been relocated to the appropriate specialist chapter.
Made Mac OS launcher 64-bit compatible. See section 2.2.1 for details.
The UIMA integration layer (Chapter 22) has been upgraded to work with Apache UIMA 2.2.2.
Oracle and PostGreSQL are no longer supported.
The MIAKT Natural Language Generation plugin has been removed.
The Minorthird plugin has been removed. Minorthird has changed significantly since this plugin was written. We will consider writing an up-to-date Minorthird plugin in the future.
A new gazetteer, Large KB Gazetteer (in the plugin ‘Gazetteer_LKB’) has been added, see Section 13.9 for details.
gate.creole.tokeniser.chinesetokeniser.ChineseTokeniser and related resources under the plugins/ANNIE/tokeniser/chinesetokeniser folder have been removed. Please refer to the Lang_Chinese plugin for resources related to the Chinese language in GATE.
Added an isInitialised() method to gate.Gate().
Added a parameter to the chemistry tagger PR (section 23.4) to allow it to operate on annotation sets other than the default one.
Plus many more smaller bugfixes...
A.20 Version 5.0 (May 2009) [#]
Note: existing users – if you delete your user configuration file for any reason you will find that GATE Developer no longer loads the ANNIE plugin by default. You will need to manually select ‘load always’ in the plugin manager to get the old behaviour.
A.20.1 Major New Features
JAPE Language Improvements
Several new extensions to the JAPE language to support more flexible pattern matching. Full details are in Chapter 8 but briefly:
-
Negative constraints, that prevent a rule from matching if certain other annotations are present (Section 8.1.11).
-
Additional matching operators for feature values, so you can now look for {Token.length < 5}, {Lookup.minorType != "ignore"}, etc. as well as simple equality (Section 8.2).
-
‘Meta-property’ accessors, see Section 8.1.3 to permit access to the string covered by an annotation, the length of the annotation, etc., e.g. {Lookup@length > 4}.
-
Contextual operators, allowing you to search for one annotation contained within (or containing) another, e.g. {Sentence contains {Lookup.majorType == "location"}} (see Section 8.2.4).
-
Additional Kleene operator for ranges, e.g. ({Token})[2,5] matches between 2 and 5 consecutive tokens, see Section 8.1.4.
-
Additional operators can be added via runtime configuration (see Section 8.2.5).
Some of these extensions are similar to, but not the same as, those provided by the Montreal Transducer plugin. If you are already familiar with the Montreal Transducer, you should first look at Section 8.10 which summarises the differences.
Resource Configuration via Java 5 Annotations
Introduced an alternative style for supplying resource configuration information via Java 5 annotations rather than in creole.xml. The previous approach is still fully supported as well, and the two styles can be freely mixed. See Section 4.7 for full details.
Ontology-Based Gazetteer
Added a new plugin ‘Gazetteer_Ontology_Based’, which contains OntoRoot Gazetteer – a dynamically created gazetteer which is, in combination with few other generic resources, capable of producing ontology-aware annotations over the given content with regards to the given ontology. For more details see Section 13.8.
Inter-Annotator Agreement and Merging
New plugins to support tasks involving several annotators working on the same annotation task on the same documents. The plugin ‘Inter_Annotator_Agreement’ (Section 10.5) computes inter-annotator agreement scores between the annotators, the ‘Copy_Annots_Between_Docs’ plugin (Section 23.19) copies annotations from several parallel documents into a single master document, and the ‘Annotation_Merging’ plugin (Section 23.18) merges annotations from multiple annotators into a single ‘consensus’ annotation set.
Packaging Self-Contained Applications for GATE Teamware
Added a mechanism to assemble a saved GATE application along with all the resource files it uses into a single self-contained package to run on another machine (e.g. as a service in GATE Teamware). This is available as a menu option (Section 3.9.4) which will work for most common cases, but for complex cases you can use the underlying Ant task described in Section E.2.
GUI Improvements
-
A new schema-driven tool to streamline manual annotation tasks (see Section 3.4.6).
-
Context-sensitive help on elements in the resource tree and when pressing F1 key. Search in mailing list from the Help menu. Help is displayed in your browser or in a Java browser if you don’t have one.
-
Improved search function inside documents with a regular expression builder. Search and replace annotation function in all annotation editors.
-
Remember for each resource type the last path used when loading/saving a resource.
-
Remember the last annotations selected in the annotation set view when you shift click on the annotation set view button.
-
Improved context menu and when possible added drag and drop in: resource tree, annotation set view, annotation list view, corpus view, controller view. Context menu key can be now used if you have Java 1.6.
-
New dialog box for error messages with user oriented messages, optional display of the configuration and proposing some useful actions. This will progressively replace the old stack trace dump into the message panel which is still here for the moment but should be hide by default in the future.
-
Add read-only document mode that can be enable from the Options menu.
-
Add a selection filter in the status bar of the annotations list table to easily select rows based on the text you enter.
-
Add the last five applications loaded/saved in the context menu of the language resources in the resources tree.
-
Display more informations on what going’s on in the waiting dialog box when running an application. The goal is to improve it to get a global progress bar and estimated time.
A.20.2 Other New Features and Improvements
-
New parser plugins: A new plugin for the Stanford Parser (see Section 18.2) and a rewritten plugin for the RASP NLP tools.
-
A new sentence splitter, based on regular expressions, has been added to the ANNIE plugin. More details in Section 6.5.
-
‘Real-time’ corpus controller (Section 4.4), which terminates processing of a document if it takes longer than a configurable timeout..
-
Major update to Annie OrthoMatcher coreference engine. Now correctly matches the sequence ‘David Jones ... David ... David Smith ... David’ as referring to two people. Also handles nicknames (David = Dave) via a new nickname list. Added optional parameter ‘highPrecisionOrgs’, which if set to true turns off riskier org matching rules. Many misc. bug fixes.
-
Improved alignment editor (Chapter 20) with several advanced features and an API for adding your own actions to the editor.
-
A new plugin for Chinese word segmentation, which is based on our work using machine learning algorithms for the Sighan-05 Chinese word segmentation task. It can learn a model from manually segmented text, and apply a learned model to segment Chinese text. In addition several learned models are available with the plugin, which can be used to segment text. For details about the plugin and those learned models see Section 15.6.1.
-
New features in the ML API to produce an n-gram based language model from a corpus and a so-called ‘document-term matrix’ (see Section 23.15). Also introduced features to support active learning, a new learning algorithm (PAUM) and various optimisations including the ability to use an external executable for SVM training. Full details in Chapter 19.
-
A new plugin to compute BDM scores for an ontology. The BDM score can be used to evaluate ontology based information extraction and classification. For details about the plugin see Section 10.6.
-
Added new ‘getCovering’ method to AnnotationSet. This method returns annotations that completely span the provided range. An optional annotation type parameter can be provided to further limit the returned set.
-
Complete redesign of ANNIC GUI. More details in Section 9.
A.20.3 Specific Bug Fixes
-
HTML document format parser: several bugs fixed, including a null pointer exception if the document contained certain characters illegal in HTML (#1754749). Also, the HTML parser now respects the ‘Add space on markup unpack’ configuration option – previously it would always add space, even if the option was set to false.
-
Fixed a severe performance bug in the Annie Pronominal Coreferencer resulting in a 50X speed improvement.
-
JAPE did not always correctly handle the case when the input and output annotation sets for a transducer were different. This has now been fixed.
-
‘Save Preserving Format’ was not correctly escaping ampersands and less than signs when two HTML entities are close together. Only the first one was replaced: A & B & C was output as A & B & C instead of A & B & C. This has now been fixed, and the fix is also valid for the flexible exporter but only if the standoff annotations parameter is set to false.
Plus many more minor bug fixes
A.21 Version 4.0 (July 2007) [#]
A.21.1 Major New Features
ANNIC
ANNotations In Context: a full-featured annotation indexing and retrieval system designed to support corpus querying and JAPE rule authoring. It is provided as part of an extension of the Serial Datastores, called Searchable Serial Datastore (SSD). See Section 9 for more details.
New Machine Learning API
A brand new machine learning layer specifically targetted at NLP tasks including text classification, chunk learning (e.g. for named entity recognition) and relation learning. See Chapter 19 for more details.
Ontology API
A new ontology API, based on OWL In Memory (OWLIM), which offers a better API, revised ontology event model and an improved ontology editor to name but few. See Chapter 14 for more details.
OCAT
Ontology-based Corpus Annotation Tool to help annotators to manually annotate documents using ontologies. For more details please see Section 14.5.
Alignment Tools
A new set of components (e.g. CompoundDocument, AlignmentEditor etc.) that help in building alignment tools and in carrying out cross-document processing. See Chapter 20 for more details.
New HTML Parser
A new HTML document format parser, based on Andy Clark’s NekoHTML. This parser is much better than the old one at handling modern HTML and XHTML constructs, JavaScript blocks, etc., though the old parser is still available for existing applications that depend on its behaviour.
Java 5.0 Support
GATE now requires Java 5.0 or later to compile and run. This brings a number of benefits:
-
Java 5.0 syntax is now available on the right hand side of JAPE rules with the default Eclipse compiler. See Section 8.6 for details.
-
enum types are now supported for resource parameters. see Section 7.12 for details on defining the parameters of a resource.
-
AnnotationSet and the CreoleRegister take advantage of generic types. The AnnotationSet interface is now an extension of Set<Annotation> rather than just Set, which should make for cleaner and more type-safe code when programming to the API, and the CreoleRegister now uses parameterized types, which are backwards-compatible but provide better type-safety for new code.
A.21.2 Other New Features and Improvements
-
Hiding the view for a particular resource (by right clicking on its tab and selecting ‘Hide this view’) will now completely close the associated viewers and dispose them. Re-selecting the same resource at a later time will lead to re-creating the necessary viewers and displaying them. This has two advantages: firstly it offers a mechanism for disposing views that are not needed any more without actually closing the resource and secondly it provides a way to refresh the view of a resource in the situations where it becomes corrupted.
-
The DataStore viewer now allows multiple selections. This lets users load or delete an arbitrarily large number of resources in one operation.
-
The Corpus editor has been completely overhauled. It now allows re-ordering of documents as well as sorting the document list by either index or document name.
-
Support has been added for resource parameters of type gate.FeatureMap, and it is also possible to specify a default value for parameters whose type is Collection, List or Set. See Section 7.3 for details.
-
(Feature Request #1446642) After several requests, a mechanism has been added to allow overriding of GATE’s document format detection routine. A new creation-time parameter mimeType has been added to the standard document implementation, which forces a document to be interpreted as a specific MIME type and prevents the usual detection based on file name extension and other information. See Section 5.5.1 for details.
-
A capability has been added to specify arbitrary sets of additional features on individual gazetteer entries. These features are passed forward into the Lookup annotations generated by the gazetteer. See Section 6.3 for details.
-
As an alternative to the Google plugin, a new plugin called yahoo has been added to to allow users to submit their query to the Yahoo search engine and to load the found pages as GATE documents. See Section C.3 for more details.
-
It is now easier to run a corpus pipeline over a single document in the GATE Developer GUI – documents now provide a right-click menu item to create a singleton corpus containing just this document. See Section 3.3 for details.
-
A new interface has been added that lets PRs receive notification at the start and end of execution of their containing controller. This is useful for PRs that need to do cleanup or other processing after a whole corpus has been processed. See Section 4.4 for details.
-
The GATE Developer GUI does not call System.exit() any more when it is closed. Instead an effort is made to stop all active threads and to release all GUI resources, which leads to the JVM exiting gracefully. This is particularly useful when GATE is embedded in other systems as closing the main GATE window will not kill the JVM process any more.
-
The set of AnnotationSchemas that used to be included in the core gate.jar and loaded as builtins have now been moved to the ANNIE plugin. When the plugin is loaded, the default annotation schemas are instantiated automatically and are available when doing manual annotation.
-
There is now support in creole.xml files for automatically creating instances of a resource that are hidden (i.e. do not show in the GUI). One example of this can be seen in the creole.xml file of the ANNIE plugin where the default annotation schemas are defined.
-
A couple of helper classes have been added to assist in using GATE within a Spring application. Section 7.15 explains the details.
-
Improvements have been made to the thread-safety of some internal components, which mean that it is now safe to create resources in multiple threads (though it is not safe to use the same resource instance in more than one thread). This is a big advantage when using GATE in a multithreaded environment, such as a web application. See Section 7.14 for details.
-
Plugins can now provide custom icons for their PRs and LRs in the plugin JAR file. See Section 7.12 for details.
-
It is now possible to override the default location for the saved session file using a system property. See Section 2.3 for details.
-
The TreeTagger plugin (‘Tagger_TreeTagger’) supports a system property to specify the location of the shell interpreter used for the tagger shell script. In combination with Cygwin this makes it much easier to use the tagger on Windows.
-
The Buchart plugin has been removed. It is superseded by SUPPLE, and instructions on how to upgrade your applications from Buchart to SUPPLE are given in Section 18.1. The probability finder plugin has also been removed, as it is no longer maintained.
-
The bootstrap wizard now creates a basic plugin that builds with Ant. Since a Unix-style make command is no longer required this means that the generated plugin will build on Windows without needing Cygwin or MinGW.
-
The GATE source code has moved from CVS into Subversion. See Section 2.2.3 for details of how to check out the code from the new repository.
-
An optional parameter, keepOriginalMarkupsAS, has been added to the DocumentReset PR which allows users to decide whether to keep the Original Markups AS or not while reseting the document. See Section 6.1 for more details.
A.21.3 Bug Fixes and Optimizations
-
The Morphological Analyser has been optimized. A new FSM based, although with minor alteration to the basic FSM algorithm, has been implemented to optimize the Morphological Analyser. The previous profiling figures show that the morpher when integrated with ANNIE application used to take upto 60% of the overall processing time. The optimized version only takes 7.6% of the total processing time. See Section 23.10 for more details on the morpher.
-
The ANNIE Sentence Splitter was optimised. The new version is about twice as fast as the previous one. The actual speed increase varies widely depending on the nature of the document.
-
The imlementation of the OrthoMatcher component has been improved. This resources takes significantly less time on large documents.
-
The implementation of AnnotationSets has been improved. GATE now requires up to 40% less memory to run and is also 20% faster on average. The get methods of AnnotationSet return instances of ImmutableAnnotationSet. Any attempt at modifying the content of these objects will trigger an Exception. An empty ImmutableAnnotationSet is returned instead of null.
-
The Chemistry tagger (Section 23.4) has been updated with a number of bugfixes and improvements.
-
The Document user interface has been optimised to deal better with large bursts of events which tend to occur when the document that is currently displayed gets modified. The main advantages brought by this new implementation are:
-
The document UI refreshes faster than before.
-
The presence of the GUI for a document induces a smaller performance penalty than it used to. Due to a better threading implementation, machines benefiting from multiple CPUs (e.g. dual CPU, dual core or hyperthreading machines) should only see a negligible increase in processing time when a document is displayed compared to the situations where the document view is not shown. In the previous version, displaying a document while it was processed used to increase execution time by an order of magnitude.
-
The GUI is more responsive now when a large number of annotations are displayed, hidden or deleted.
-
The strange exceptions that used to occur occasionally while working with the document GUI should not happen any more.
-
And as always there are many smaller bugfixes too numerous to list here...
A.22 Version 3.1 (April 2006)
A.22.1 Major New Features
Support for UIMA
UIMA (http://www.research.ibm.com/UIMA/) is a language processing framework developed by IBM. UIMA and GATE share some functionality but are complementary in most respects. GATE now provides an interoperability layer to allow UIMA applications to include GATE components in their processing and vice-versa. For full information, see Chapter22.
New Ontology API
The ontology layer has been rewritten in order to provide an abstraction layer between the model representation and the tools used for input and output of the various representation formats. An implementation that uses Jena 2 (http://jena.sourceforge.net/ontology) for reading and writing OWL and RDF(S) is provided.
Ontotext Japec Compiler
Japec is a compiler for JAPE grammars developed by Ontotext Lab. It has some limitations compared to the standard JAPE transducer implementation, but can run JAPE grammars up to five times as fast. By default, GATE still uses the stable JAPE implementation, but if you want to experiment with Japec, see Section C.1.
A.22.2 Other New Features and Improvements
-
Addition of a new JAPE matching style ‘all’. This is similar to Brill, but once all rules from a given start point have matched, the matching will continue from the next offset to the current one, rather than from the position in the document where the longest match finishes. More details can be found in Section 8.4.
-
Limited support for loading PDF and Microsoft Word document formats. Only the text is extracted from the documents, no formatting information is preserved.
-
The Buchart parser has been deprecated and replaced by a new plugin called SUPPLE - the Sheffield University Prolog Parser for Language Engineering. Full details, including information on how to move your application from Buchart to SUPPLE, is in Section 18.1.
-
The Hepple POS Tagger is now open-source. The source code has been included in the GATE Developer/Embedded distribution, under src/hepple/postag. More information about the POS Tagger can be found in Section 6.6.
-
Minipar is now supported on Windows. minipar-windows.exe, a modified version of pdemo.cpp is added under the gate/plugins/Parser_Minipar directory to allow users to run Minipar on windows platform. While using Minipar on Windows, this binary should be provided as a value for miniparBinary parameter. (The Minipar plugin has been subsequently retired.)
-
The XmlGateFormat writer(Save As Xml from GATE Developer GUI, gate.Document.toXml() from GATE Embedded API) and reader have been modified to write and read GATE annotation IDs. For backward compatibility reasons the old reader has been kept. This change fixes a bug which manifested in the following situation: If a GATE document had annotations carrying features of which values were numbers representing other GATE annotation IDs, after a save and a reload of the document to and from XML, the former values of the features could have become invalid by pointing to other annotations. By saving and restoring the GATE annotation ID, the former consistency of the GATE document is maintained. For more information, see Section 1.
-
The NP chunker and chemistry tagger plugins have been updated. Mark A. Greenwood has relicenced them under the LGPL, so their source code has been moved into the GATE Developer/Embedded distribution. See Sections 23.2 and 23.4 for details.
-
The Tree Tagger wrapper has been updated with an option to be less strict when characters that cannot be represented in the tagger’s encoding are encountered in the document.
-
JAPE Transducers can be serialized into binary files. The option to load serialized version of JAPE Transducer (an init-time parameter binaryGrammarURL) is also implemented which can be used as an alternative to the parameter grammarURL. More information can be found in Section 8.9.
-
On Mac OS, GATE Developer now behaves more ‘naturally’. The application menu items and keyboard shortcuts for About and Preferences now do what you would expect, and exiting GATE Developer with command-Q or the Quit menu item properly saves your options and current session.
-
Updated versions of Weka(3.4.6) and Maxent(2.4.0).
-
Optimisation in gate.creole.ml: the conversion of AnnotationSet into ML examples is now faster.
-
It is now possible to create your own implementation of Annotation, and have GATE use this instead of the default implementation. See AnnotationFactory and AnnotationSetImpl in the gate.annotation package for details.
A.22.3 Bug Fixes
-
The Tree Tagger wrapper has been updated in order to run under Windows.
-
The SUPPLE parser has been made more user-friendly. It now produces more helpful error messages if things go wrong. Note that you will need to update any saved applications that include SUPPLE to work with this version - see Section 18.1 for details.
-
Miscellaneous fixes in the Ontotext JapeC compiler.
-
Optimization : the creation of a Document is much faster.
-
Google plugin: The optional pagesToExclude parameter was causing a NullPointerException when left empty at run time. Full details about the plugin functionality can be found in Section C.2.
-
Minipar, SUPPLE, TreeTagger: These plugins that call external processes have been fixed to cope better with path names that contain spaces. Note that some of the external tools themselves still have problems handling spaces in file names, but these are beyond our control to fix. If you want to use any of these plugins, be sure to read the documentation to see if they have any such restrictions. (The Minipar plugin has been subsequently retired.)
-
When using a non-default location for GATE configuration files, the configuration data is saved back to the correct location when GATE exits. Previously the default locations were always used.
-
Jape Debugger: ConcurrentModificationException in JAPE debugger. The JAPE debugger was generating a ConcurrentModificationException during an attempt to run ANNIE. There is no exception when running without the debugger enabled. As result of fixing one unnecessary and incorrect callback to debugger was removed from SinglePhaseTransducer class.
-
Plus many other small bugfixes...
A.23 January 2005
Release of version 3.
New plugins for processing in various languages (see 15). These are not full IE systems but are designed as starting points for further development (French, German, Spanish, etc.), or as sample or toy applications (Cebuano, Hindi, etc.).
Other new plugins:
-
Chemistry Tagger 23.4
-
Montreal Transducer (since retired)
-
RASP Parser
-
MiniPar (since retired)
-
Buchart Parser 18.1
-
MinorThird (Version 5.1: removed)
-
NP Chunker 23.2
-
Stemmer 23.9
-
TreeTagger
-
Probability Finder
-
Crawler
-
Google PR C.2
Support for SVM Light, a support vector machine implementation, has been added to the machine learning plugin ‘Learning’.
A.24 December 2004
GATE no longer depends on the Sun Java compiler to run, which means it will now work on any Java runtime environment of at least version 1.4. JAPE grammars are now compiled using the Eclipse JDT Java compiler by default.
A welcome side-effect of this change is that it is now much easier to integrate GATE-based processing into web applications in Tomcat.
A.25 September 2004
GATE applications are now saved in XML format using the XStream library, rather than by using native java serialization. On loading an application, GATE will automatically detect whether it is in the old or the new format, and so applications in both formats can be loaded. However, older versions of GATE will be unable to load applications saved in the XML format. (A java.io.StreamCorruptedException: invalid stream header exception will occcur.) It is possible to get new versions of GATE to use the old format by setting a flag in the source code. (See the Gate.java file for details.) This change has been made because it allows the details of an application to be viewed and edited in a text editor, which is sometimes easier than loading the application into GATE.
A.26 Version 3 Beta 1 (August 2004)
Version 3 incorporates a lot of new functionality and some reorganisation of existing components.
Note that Beta 1 is feature-complete but needs further debugging (please send us bug reports!).
Highlights include: completely rewritten document viewer/editor; extensive ontology support; a new plugin management system; separate .jar files and a Tomcat classloading fix; lots more CREOLE components (and some more to come soon).
Almost all the changes are backwards-compatible; some recent classes have been renamed (particularly the ontologies support classes) and a few events added (see below); datastores created by version 3 will probably not read properly in version 2. If you have problems use the mailing list and we’ll help you fix your code!
The gorey details:
-
Anonymous CVS is now available. See Section 2.2.3 for details.
-
CREOLE repositories and the components they contain are now managed as plugins. You can select the plugins the system knows about (and add new ones) by going to ‘Manage CREOLE Plugins’ on the file menu.
-
The gate.jar file no longer contains all the subsidiary libraries and CREOLE component resources. This makes it easier to replace library versions and/or not load them when not required (libraries used by CREOLE builtins will now not be loaded unless you ask for them from the plugins manager console).
-
ANNIE and other bundled components now have their resource files (e.g. pattern files, gazetteer lists) in a separate directory in the distribution – gate/plugins.
-
Some testing with Sun’s JDK 1.5 pre-releases has been done and no problems reported.
-
The gate:// URL system used to load CREOLE and ANNIE resources in past releases is no longer needed. This means that loading in systems like Tomcat is now much easier.
-
MAC OS X is now properly supported by the installed and the runtime.
-
An Ontology-based Corpus Annotation Tool (OCAT) has been implemented as a plugin. Documentation of its functionality is in Section 14.5.
-
The NLG Lexical tools from the MIAKT project have now been released.
-
The Features viewer/editor has been completely updated – see Section 3.4.5 for details.
-
The Document editor has been completely rewritten – see Section 3.2 for more information.
-
The datastore viewer is now a full-size VR – see Section 3.9.2 for more information.
A.27 July 2004
GATE documents now fire events when the document content is edited. This was added in order to
support the new facility of editing documents from the GUI. This change will break backwards
compatibility by requiring all DocumentListener implementations to implement a new method:
public void contentEdited(DocumentEvent e);
A.28 June 2004
A new algorithm has been implemented for the AnnotationDiff function. A new, more usable, GUI is included, and an ‘Export to HTML’ option added. More details about the AnnotationDiff tool are in Section 10.2.1.
A new build process, based on ANT (http://ant.apache.org/) is now available. The old build process, based on make, is now unsupported. See Section 2.6 for details of the new build process.
A Jape Debugger from Ontos AG has been integrated. You can turn integration ON with command line option ‘-j’. If you run GATE Developer with this option, the new menu item for Jape Debugger GUI will appear in the Tools menu. The default value of integration is OFF. We are currently awaiting documentation for this.
NOTE! Keep in mind there is ClassCastException if you try to debug ConditionalCorpusPipeline. Jape Debugger is designed for Corpus Pipeline only. The Ontos code needs to be changed to allow debugging of ConditionalCorpusPipeline.
A.29 April 2004
There are now two alternative strategies for ontology-aware grammar transduction:
-
using the [ontology] feature both in grammars and annotations; with the default Transducer.
-
using the ontology aware transducer – passing an ontology LR to a new subsume method in the SimpleFeatureMapImpl. the latter strategy does not check for ontology features (this will make the writing of grammars easier – no need to specify ontology).
The changes are in:
-
SinglePhaseTransducer (always call subsume with ontology – if null then the ordinary subsumption takes place)
-
SimpleFeatureMapImpl (new subsume method using an ontology LR)
More information about the ontology-aware transducer can be found in Section 14.8.
A morphological analyser PR has been added. This finds the root and affix values of a token and adds them as features to that token.
A flexible gazetteer PR has been added. This performs lookup over a document based on the values of an arbitrary feature of an arbitrary annotation type, by using an externally provided gazetteer. See 13.6 for details.
A.30 March 2004
Support was added for the MAXENT machine learning library.
A.31 Version 2.2 – August 2003
Note that GATE 2.2 works with JDK 1.4.0 or above. Version 1.4.2 is recommended, and is the one included with the latest installers.
GATE has been adapted to work with Postgres 7.3. The compatibility with PostgreSQL 7.2 has been preserved.
Note that as of Version 5.1 PostgreSQL is no longer supported.
New library version – Lucene 1.3 (rc1)
A bug in gate.util.Javac has been fixed in order to account for situations when String literals require an encoding different from the platform default.
Temporary .java files used to compile JAPE RHS actions are now saved using UTF-8 and the ‘-encoding UTF-8’ option is passed to the javac compiler.
A custom tools.jar is no longer necessary
Minor changes have been made to the look and feel of GATE Developer to improve its appearance with JDK 1.4.2
A.32 Version 2.1 – February 2003
Integration of Machine Learning PR and WEKA wrapper.
Addition of DAML+OIL exporter.
Integration of WordNet (see Section 23.16).
The syntax tree viewer has been updated to fix some bugs.
A.33 June 2002
Conditional versions of the controllers are now available (see Section 3.8.2). These allow processing resources to be run conditionally on document features.
PostgreSQL Datastores are now supported.
These store data into a PostgreSQL RDBMS.
(As of Version 5.1 PostgreSQL is no longer supported.)
Addition of OntoGazetteer (see Section 13.3), an interface which makes ontologies visible within GATE Developer, and supports basic methods for hierarchy management and traversal.
Integration of Protégé, so that people with developed Protégé ontologies can use them within GATE.
Addition of IR facilities in GATE (see Section 23.15).
Modification of the corpus benchmark tool (see Section 10.4.3), which now takes an application as a parameter.
See also for details of other recent bug fixes.
2Existing saved applications using the controller parameter will still work provided the controller in question implements the LanguageAnalyser interface. The CorpusController implementations supplied as standard with GATE all implement this interface.