Log in Help
Print
Homesaletao 〉 splitap1.html
 

Appendix A
Change Log [#]

This chapter lists major changes to GATE (currently Developer and Embedded only) in roughly chronological order by release. Changes in the documentation are also referenced here.

A.1 Next Release [#]

A.1.1 April 2014

GATE now requires Java 7. If you are stuck on Java 6 then you will need to continue to use GATE 7.1.

The Lucene specific implementation of the Information Retrieval plugin (Section 22.17) has been moved out of the core and into the plugin. This means that Lucene is no longer a core dependency allowing individual plugins to specify any version of Lucene they may require.

A.1.2 March 2014

A new plugin containing a number of developer orientated tools has been added to the core distribution. See Section 22.38 for more details.

Fixed a bug in the ANNIC query parser related to the escaping of reserved characters within feature values.

Added a PR to access the TextRazor online annotation service. See section 22.7 for details.

Updated GATE to support Java 8.

A.1.3 February 2014

The Twitter JSON document format and corpus population tool are now documented in the Twitter plugin (Section 22.35).

There has been a fairly extensive clean out of old deprecated and unsupported code from the main code base. In general this should not cause any problems, as the code has been deprecated for a long time, with sensible replacements provided. The one change that might cause problems is the removal of the annotations parameters from the right hand side of JAPE rules. If you try loading an old JAPE grammar and it fails with an error along the lines of Error: annotations cannot be resolved at line X then you will need to fix the JAPE files before they will load. Either update the code to correctly use the inputAS and outputAS parameters or if this looks too complex simply add the following line to the beginning of the right hand side code block: AnnotationSet annotations = outputAS;

A.1.4 January 2014

A new plugin that allows for document normalization has been added. This plugin is predominately aimed at normalizing punctuation symbols (i.e. replacing Word style apostrophies and hypens with their ASCII equivalemts) to provide a common baseline for further components. See Section 22.37 for further details.

A.1.5 December 2013

The Relations API (Section 7.7) has been updated as relations are now treated as “first-class citizens” in a similar way to annotations. The major changes are that relation sets are now directly associated with an annotation set and can only contain relations between members of that set or other relations within the same set, and that relations are now handled separately when stored in a GATE XML document rather than being serialized as a document feature. There is currently no support for directly editing relations within GATE Developer (a simple viewer is provided as a tab in the document viewer), so relations can only be created via the API.

The GATE XML version number has been pushed up to 3 (it was previously 2) as the saving of relations into the GATE XML files mean they cannot be opened with previous versions of GATE. Documents without relations are still saved in a backwards compatible format, but the change in version number will help with diagnosing bug reports etc.

A new plugin to support processing Bulgarian text has been added. Currently this consists of a PR that integrates the BulStem stemmer using code kindly donated by Ivelina Nikolova. See Section 15.9 for full details.

A.1.6 November 2013

A new plugin, Format_CSV, provides support for populating a corpus from one or more CSV files. See Section 22.34 for details.

There have been a number of changes to the support for Annotation Schemas:

The GATE source code has been restructured slightly to make the separation of the main and test code explicit. This ensures that no code can ever end up relying upon test code accidentally. Changes have also been made to the Eclipse project files to separate the core code from the plugins.

A.1.7 September 2013

A number of changes have been made to the Groovy plugin:

A new plugin that wraps the Stanford Part-of-Speech tagger. See Section 22.26 for details.

A.1.8 August 2013

Added support for Resource Helpers which can added new features to exisiting resource types without requiring a Visual Resource and which can also be accessiable via the embedded API. See Sections 4.8.2 and 7.20 for details.

Added support for populating a corpus from a MediaWiki XML dump file. Previously loading an XML file containing multiple pages resulted in only the last page being used to create the document. See Section 22.32 for full details.

A new plugin to support the reading and writing of compressed XML files in the Fast Infoset format. This format gives space savings of around 80% when used to store GATE XML documents. For full details see Section 22.33.

New support for processing Russian including a part-of-speech tagger, morphological analyser and a gazetteer. See Section 15.8 for more details.

A.1.9 March 2013

Fixed a bug which caused the 7.1 version of the OntoRootGazetteer to produce no Lookup annotations in its default configuration.

A.2 Version 7.1 (November 2012) [#]

A.2.1 New plugins

The TermRaider plugin (see Section 22.36) provides a toolkit and sample application for term extraction.

Two new plugins, Tagger_Zemanta (see Section 22.5) and Tagger_Lupedia (see Section 22.6) provide PRs that wrap online annotation services provided by Zemanta and Ontotext.

A new plugin named Coref_Tools includes a framework for fast co-reference processing, and one PR that performs orthographical co-reference in the style of the ANNIE Orthomatcher. See Section 22.30 for full details.

A new Configurable Exporter PR in the Tools plugin, allowing annotations and features to be exported in formats specified by the user (e.g. for use with external machine learning tools). See Section 22.14 for details.

Support for reading a number of new document formats has also been added:

In addition, “ready-made applications” have been added to many existing plugins (notably the Lang_* non-English language plugins) to make it easier to experiment with their PRs.

A.2.2 Library updates

Updated the Stanford Parser plugin (see Section 17.4) to version 2.0.4 of the parser itself, and added run-time parameters to the PR to control the parser’s dependency options.

The Measurement and Number taggers have been upgraded to use JAPE+ instead of JAPE. This should result in faster processing, and also allows for more memory efficient duplication of PR instances, i.e. when a pool of applications is created.

The OpenNLP plugin has been completely revised to use Apache OpenNLP 1.5.2 and the corresponding set of models. See Section 22.25 for details.

The native launcher for GATE on Mac OS X now works with Oracle Java 7 as well as Apple Java 6.

A.2.3 GATE Embedded API changes

Some of the most significant changes in this version are “under the bonnet” in GATE Embedded:

And numerous smaller bug fixes and performance improvements…

A.3 Version 7.0 (February 2012) [#]

A.3.1 Major new features

The CREOLE Plugin Manager has been completely re-written and now includes support for installing new plugins from remote update sites. See sections 3.6 and 12.3.5 for more details. In addition, plugins can now contribute additional “ready-made applications” to the GATE Developer menus alongside the standard applications (ANNIE, etc.). Details can be found in section 12.3.4.

A new plugin named JAPE_Plus has been added. It contains a new JAPE execution engine that includes various optimisations and should be significantly faster than the standard engine. JAPE_Plus has not yet been comprehensively tested, so it should be considered beta software, and used with caution. See Section 8.11 for more details.

A new Java-based launcher has been implemented which now replaces the use of Apache ANT for starting-up GATE Developer. The GATE Developer application now behaves in a more natural way in dock-based desktop environments such as Mac OS X and Ubuntu Unity.

Improved the support for processing biomedical text by adding new PRs to incorporate the following tools: AbGene, the NormaGene tagger, the GENIA sentence splitter, MutationFinder and the Penn BioTagger (contains a tokenizer and three taggers for gene, malignancy and variation). For full details of these new resources see section 16.1.

The Flexible Gazetteer PR has been rewritten to provide a better and faster implementation. The two parameters inputAnnotationSetName and outputAnnotationSetName have been renamed to inputASName and outputASName, however old applications with the old parameters should still work. Please see Section 13.6 for more details.

A.3.2 Removal of deprecated functionality

Various components were removed in this release as they have been unsupported and deprecated in previous releases:

In addition the Web_Search_Google, Web_Search_Yahoo and Web_Translate_Google plugins have been removed as the underlying web services on which they depend are no longer available. Documentation for obsolete plugins can be found in appendix C, and if you require any of them for your application please see plugins/Obsolete/README.TXT in the GATE Developer distribution.

A.3.3 Other enhancements and bug fixes

CREOLE plugins can now use Apache Ivy to include third-party dependencies. See section 4.7.4 for details.

The Default ANNIE Gazetteer now allows a user to specify different annotation types to be used for annotating entries from different lists. For example, a user may want to find city names mentioned in a gazetteer list (e.g. city.lst) and annotate the matching strings as City. Please see section 6.3 for more details.

The Segment Processing PR has two additional run-time parameters called segmentAnnotationFeatureName and segmentAnnotationFeatureValue. These features allow users to specify a constraint on feature name and feature value. If user has provided values for these parameters, only the annotations with the specified feature name and feature value are processed with the Segment Processing PR. Also, the parameter controller has been renamed to analyser which means the Segment Processing PR can now also run an individual PR on the specified segments1. See 19.2.10 for more information on section-by-section processing.

The Hash Gazetteer (section 13.5) now properly supports the caseSensitive parameter (previously the parameter could be set but had no effect).

The Document Reset PR (Section 6.1) now defaults to keeping the Key set as well as Original markups. This makes working with pre-annotated gold standard document less dangerous (assuming you put the gold standard annotations in a set called Key).

Updated Stanford Parser plugin (see Section 17.4) to version 1.6.8.

The TextCat based Language Identification PR now supports generating new language fingerprints. See section 15.1 for full details.

Added support for reading XCAS and XMI-format documents created by UIMA. See section 5.5.9 for details.

Various improvements to the GATE Developer GUI:

The rule and phase names are now accessible in a JAPE Java RHS by the ruleName() and phaseName() methods and the name of the JAPE processing resource executing the JAPE transducer is accessible through the action context getPRName() method. See section 8.6.5.

A.4 Version 6.1 (April 2011) [#]

A.4.1 New CREOLE Plugins

Tagger_Numbers to annotate many kinds of numbers in documents and determine their numeric values. The tagger can annotate numbers expressed in many forms including Arabic and Roman numerals, words (in English, French, German and Spanish) and scientific notation (4.3e6 = 4300000). See section 22.8 for full details.

Tagger_Measurements to annotate many different forms of measurement expressions (“5.5 metres”, “1 minute 30 seconds”, “10 to 15 pounds”, etc.) along with their normalized values in SI units. See section 22.9 for full details.

Tagger_Boilerpipe, which contains a boilerpipe2 based PR for performing content detection. See section 22.27 for full details.

Tagger_DateNormalizer to annotate and normalize dates within a document. See section 22.10 for full details.

Schema_Tools providing a “Schema Enforcer” PR that can be used to create a clean output annotation set based on a set of annotation schemas. See section 22.16 for full details.

Teamware_Tools providing a new PR called QA Summariser for Teamware. When documents are annotated using GATE Teamware, this PR can be used for generating a summary of agreements among annotators. See section 10.7 for full details.

Tagger_MetaMap has been rewritten to make use of the new MetaMap Java API features. There are numerous performance enhancements and bug fixes detailed in section 16.1.2. Note that this version of the plugin is not compatible with the version provided in GATE 6.0, though this earlier version is still available in the Obsolete directory if required.

A.4.2 Other new features and improvements

Added support for handling controller events to JAPE by making it possible to define ControllerStarted, ControllerFinished, and ControllerAborted code blocks in a JAPE file (see section 8.6.5).

JAPE Java right-hand-side code can now access an ActionContext object through the predefined field ctx which allows access to the corpus LR and the transducer PR and their features (see section 8.6.5).

Three new optional attributes can be specified in <GATECONFIG> element of gate.xml or local configuration file:

Setting these attributes will alter GATE’s default namespace deserialization behaviour to remove the namespace prefix and add it as a feature, along with the namespace URI. This allows namespace-prefixed elements in the Original markups annotation set to be matched with JAPE expressions, and also allows namespace scope to be added to new annotations when serialized to XML. See 5.5.2 for details.

Searchable Serial Datastores (Lucene-based) are now portable and can be moved across different systems. Also, several GUI improvements have been made to ease the creation of Lucene datastores. See chapter 9 for details.

The populate method that allowed populating corpus from a trecweb file has been made more generic to accept a tag. The method extracts content between the start and end of this tag to create new documents. In GATE Developer, right-clicking on an instance of the Corpus and choosing the option “Populate from Single Concatenated File" allows users to populate the corpus using this functionality. See Section 7.4.5 for more details.

Fixed a regression in the JAPE parser that prevented the use of RHS macros that refer to a LHS label (named blocks :label { ... } and assignments :label.Type = {}

Enhanced the Groovy scriptable controller with some features inspired by the realtime controller, in particular the ability to ignore exceptions thrown by PRs and the ability to limit the running time of certain PRs. See section 7.17.3 for details.

The Ontology and Gazetteer_LKB plugins have been upgraded to use Sesame 3.2.3 and OWLIM 3.5.

The Websphinx Crawler PR (section 22.18) has new runtime parameters for controlling the maximum page size and spoofing the user-agent.

A few bug fixes and improvements to the “recover” logic of the packagegapp Ant task (see section E.2).

…and many other smaller bugfixes.

Note: As of version 6.1, GATE Developer and Embedded require Java 6 or later and will no longer run on Java 5. If you require Java 5 compatibility you should use GATE 6.0.

A.5 Version 6.0 (November 2010) [#]

A.5.1 Major new features

Added an annotation tool for the document editor: the Relation Annotation Tool (RAT). It is designed to annotate a document with ontology instances and to create relations between annotations with ontology object properties. It is close and compatible with the Ontology Annotation Tool (OAT) but focus on relations between annotations. See section 14.7 for details.

Added a new scriptable controller to the Groovy plugin, whose execution strategy is controlled by a simple Groovy DSL. This supports more powerful conditional execution than is possible with the standard conditional controllers (for example, based on the presence or absence of a particular annotation, or a combination of several document feature values), rich flow control using Groovy loops, etc. See section 7.17.3 for details.

A new version of Alignment Editor has been added to the GATE distribution. It consists of several new features such as the new alignment viewer, ability to create alignment tasks and store in xml files, three different views to align the text (links view and matrix view - suitable for character, word and phrase alignments, parallel view - suitable for sentence or long text alignment), an alignment exporter and many more. See chapter 19 for more information.

MetaMap, from the National Library of Medicine (NLM), maps biomedical text to the UMLS Metathesaurus and allows Metathesaurus concepts to be discovered in a text corpus. The Tagger_MetaMap plugin for GATE wraps the MetaMap Java API client to allow GATE to communicate with a remote (or local) MetaMap PrologBeans mmserver and MetaMap distribution. This allows the content of specified annotations (or the entire document content) to be processed by MetaMap and the results converted to GATE annotations and features. See section 16.1.2 for details.

A new plugin called Web_Translate_Google has been added with a PR called Google Translator PR in it. It allows users to translate text using the Google translation services. See section C.5 for more information.

New Gazetteer Editor for ANNIE Gazetteer that can be used instead of Gaze. It uses tables instead of text area to display the gazetteer definition and lists, allows sorting on any column, filtering of the lists, reloading a list, etc. See section 13.2.2.

A.5.2 Breaking changes

This release contains a few small changes that are not backwards-compatible:

A.5.3 Other new features and bugfixes

The concept of templates has been introduced to JAPE. This is a way to declare named “variables” in a JAPE grammar that can contain placeholders that are filled in when the template is referenced. See section 8.1.6 for full details.

Added a JAPE operator to get the string covered by a left-hand-side label and assign it to a feature of a new annotation on the right hand side (see section 8.1.3).

Added a new API to the CREOLE registry to permit plugins that live entirely on the classpath. CreoleRegister.registerComponent instructs the registry to scan a single java Class for annotations, adding it to the set of registered plugins. See section 7.3 for details.

Maven artifacts for GATE are now published to the central Maven repository. See section 2.6.1 for details.

Bugfix: DocumentImpl no longer changes its stringContent parameter value whenever the document’s content changes. Among other things, this means that saved application states will no longer contain the full text of the documents in their corpus, and documents containing XML or HTML tags that were originally created from string content (rather than a URL) can now safely be stored in saved application states and the GATE Developer saved session.

A processing resource called Quality Assurance PR has been added in the Tools plugin. The PR wraps the functionality of the Quality Assurance Tool (section 10.3).

A new section for using the Corpus Quality Assurance from GATE Embedded has been written. See section 10.3.

The Generic Tagger PR (in the Tagger_Framework plugin) now allows more flexible specification of the input to the tagger, and is no longer limited to passing just the “string” feature from the input annotations. See section 22.3 for details.

Added new parameters and options to the LingPipe Language Identifier PR. (section 22.24.5), and corrected the documentation for the LingPipe POS Tagger (section 22.24.3).

In the document editor, fixed several exceptions to make editing text with annotations highlighted working. So you should now be able to edit the text and the annotations should behave correctly that is to say move, expand or disappear according to the text insertions and deletions.

Options for document editor: read-only and insert append/prepend have been moved from the options dialogue to the document editor toolbar at the top right on the triangle icon that display a menu with the options. See section 3.2.

Added new parameters and options to the Crawl PR and document features to its output; see section 22.18 for details.

Fixed a bug where ontology-aware JAPE rules worked correctly when the target annotation’s class was a subclass of the class specified in the rule, but failed when the two class names matched exactly.

Improved support for conditional pipelines containing non-LanguageAnalyser processing resources.

Added the current Corpus to the script binding for the Groovy Script PR, allowing a Groovy script to access and set corpus-level features. Also added callbacks that a Groovy script can implement to do additional pre- or post-processing before the first and after the last document in a corpus. See section 7.17 for details.

A.6 Version 5.2.1 (May 2010) [#]

This is a bugfix release to resolve several bugs that were reported shortly after the release of version 5.2:

This release also fixes some shortcomings in the Groovy support added by 5.2, in particular:

A.7 Version 5.2 (April 2010) [#]

A.7.1 JAPE and JAPE-related

Introduced a utility class gate.Utils containing static utility methods for frequently-used idioms such as getting the string covered by an annotation, finding the start and end offsets of annotations and sets, etc. This class is particularly useful on the right hand side of JAPE rules (section 8.6.5).

Added type parameters to the bindings map available on the RHS of JAPE rules, so you can now do AnnotationSet as = bindings.get("label") without a cast (see section 8.6.5).

Fixed a bug with JAPE’s handling of features called “class” in non-ontology-aware mode. Previously JAPE would always match such features using an equality test, even if a different operator was used in the grammar, i.e. {SomeType.class != "foo"} was matched as {SomeType.class == "foo"}. The correct operator is now used. Note that this does not affect the ontology-aware behaviour: when an ontology parameter is specified, “class” features are always matched using ontology subsumption.

Custom JAPE operators and annotation accessors can now be loaded from plugins as well as from the lib directory (see section 8.2.5).

A.7.2 Other Changes

Added a mechanism to allow plugins to contribute menu items to the “Tools” menu in GATE Developer. See section 4.8 for details.

Enhanced Groovy support in GATE: the Groovy console and Groovy Script PR (in the Groovy plugin) now import many GATE classes by default, and a number of utility methods are mixed in to some of the core GATE API classes to make them more natural to use in Groovy. See section 7.17 for details.

Modified the batch learning PR (in the Learning plugin) to make it safe to use several instances in APPLICATION mode with the same configuration file and the same learned model at the same time (e.g. in a multithreaded application). The other modes (including training and evaluation) are unchanged, and thus are still not safe for use in this way. Also fixed a bug that prevented APPLICATION mode from working anywhere other than as the last PR in a pipeline when running over a corpus in a datastore.

Introduced a simple way to create duplicate copies of an existing resource instance, with a way for individual resource types to override the default duplication algorithm if they know a better way to deal with duplicating themselves. See section 7.8.

Enhanced the Spring support in GATE to provide easy access to the new duplication API, and to simplify the configuration of the built-in Spring pooling mechanisms when writing multi-threaded Spring-based applications. See section 7.15.

The GAPP packager Ant task now respects the ordering of mapping hints, with earlier hints taking precedence over later ones (see section E.2.3).

Bug fix in the UIMA plugin from Roland Cornelissen - AnalysisEnginePR now properly shuts down the wrapped AnalysisEngine when the PR is deleted.

Patch from Matt Nathan to allow several instances of a gazetteer PR in an embedded application to share a single copy of their internal data structures, saving considerable memory compared to loading several complete copies of the same gazetteer lists (see section 13.10).

In the corpus quality assurance, measures for classification tasks have been added. You can also now set the beta for the fscore. This tool has been optimised to work with datastores so that it doesn’t need to read all the documents before comparing them.

A.8 Version 5.1 (December 2009) [#]

Version 5.1 is a major increment with lots of new features and integration of a number of important systems from 3rd parties (e.g. LingPipe, OpenNLP, OpenCalais, a revised UIMA connector). We’ve stuck with the 5 series (instead of jumping to 6.0) because the core remains stable and backwards compatible.

Other highlights include:

A.8.1 New Features

LingPipe Support

LingPipe is a suite of Java libraries for the linguistic analysis of human language. We have provided a plugin called ‘LingPipe’ with wrappers for some of the resources available in the LingPipe library. For more details, see the section 22.24.

OpenNLP Support

OpenNLP provides tools for sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference. The tools use Maximum Entropy modelling. We have provided a plugin called ‘OpenNLP’ with wrappers for some of the resources available in the OpenNLP Tools library. For more details, see section 22.25.

OpenCalais Support

We added a new PR called ‘OpenCalais PR’. This will process a document through the OpenCalais service, and add OpenCalais entity annotations to the document. For more details, see Section 22.23.

Ontology API

The ontology API (package gate.creole.ontology has been changed, the existing ontology implementation based on Sesame1 and OWLIM2 (package gate.creole.ontology.owlim) has been moved into the plugin Ontology_OWLIM2. An upgraded implementation based on Sesame2 and OWLIM3 that also provides a number of new features has been added as plugin Ontology. See Section 14.13 for a detailed description of all changes.

Benchmarking Improvements

A number of improvements to the benchmarking support in GATE. JAPE transducers now log the time spent in individual phases of a multi-phase grammar and by individual rules within each phase. Other PRs that use JAPE grammars internally (the pronominal coreferencer, English tokeniser) log the time taken by their internal transducers. A reporting tool, called ‘Profiling Reports’ under the ‘Tools’ menu makes summary information easily available. For more details, see chapter 11.

GUI improvements

To deal with quality assurance of annotations, one component has been updated and two new components have been added. The annotation diff tool has a new mode to copy annotations to a consensus set, see section 10.2.1. An annotation stack view has been added in the document editor and it allows to copy annotations to a consensus set, see section 3.4.3. A corpus view has been added for all corpus to get statistics like precision, recall and F-measure, see section 10.3.

An annotation stack view has been added in the document editor to make easier to see overlapping annotations, see section 3.4.3.

ABNER Support

ABNER is A Biomedical Named Entity Recogniser, for finding entities such as genes in text. We have provided a plugin called ‘AbnerTagger’ with a wrapper for ABNER. For more details, see section 16.1.1.

Generic Tagger Support

A new plugin has been added to provide an easy route to integrate taggers with GATE. The Tagger_Framework plugin provides examples of incorporating a number of external taggers which should serve as a starting point for using other taggers. See Section 22.3 for more details.

Section-by-Section Processing

We have added a new PR called ‘Segment Processing PR’. As the name suggests this PR allows processing individual segments of a document independently of one other. For more details, please look at the section 19.2.10.

Application Composition

The gate.Controller implementations provided with the main GATE distribution now also implement the gate.ProcessingResource interface. This means that an application can now contain another application as one of its components.

Groovy Support

Groovy is a dynamic programming language based on Java. You can now use it as a scripting language for GATE, via the Groovy Console. For more details, see Section 7.17.

A.8.2 JAPE improvements

GATE now produces a warning when any Java right-hand-sides in JAPE rules make use of the deprecated annotations parameter. All bundled JAPE grammars have been updated to use the replacement inputAS and outputAS parameters as appropriate.

The new Imports: statement at the beginning of a JAPE grammar file can now be used to make additional Java import statements available to the Java RHS code, see 8.6.5.

The JAPE debugger has been removed. Debugging of JAPE has been made easier as stack traces now refer to the JAPE source file and line numbers instead of the generated Java source code.

The Montreal Transducer has been made obsolete.

A.8.3 Other improvements and bug fixes

Plugin names have been rationalised. Mappings exist so that existing applications will continue to work, but the new names should be used in the future. Plugin name mappings are given in Appendix B. Also, the Segmenter_Chinese plugin (used to be known as chineseSegmenter plugin) is now part of the Lang_Chinese plugin.

The User Guide has been amalgamated with the Programmer’s Guide; all material can now be found in the User Guide. The ‘How-To’ chapter has been converted into separate chapters for installation, GATE Developer and GATE Embedded. Other material has been relocated to the appropriate specialist chapter.

Made Mac OS launcher 64-bit compatible. See section 2.2.1 for details.

The UIMA integration layer (Chapter 21) has been upgraded to work with Apache UIMA 2.2.2.

Oracle and PostGreSQL are no longer supported.

The MIAKT Natural Language Generation plugin has been removed.

The Minorthird plugin has been removed. Minorthird has changed significantly since this plugin was written. We will consider writing an up-to-date Minorthird plugin in the future.

A new gazetteer, Large KB Gazetteer (in the plugin ‘Gazetteer_LKB’) has been added, see Section 13.9 for details.

gate.creole.tokeniser.chinesetokeniser.ChineseTokeniser and related resources under the plugins/ANNIE/tokeniser/chinesetokeniser folder have been removed. Please refer to the Lang_Chinese plugin for resources related to the Chinese language in GATE.

Added an isInitialised() method to gate.Gate().

Added a parameter to the chemistry tagger PR (section 22.4) to allow it to operate on annotation sets other than the default one.

Plus many more smaller bugfixes...

A.9 Version 5.0 (May 2009) [#]

Note: existing users – if you delete your user configuration file for any reason you will find that GATE Developer no longer loads the ANNIE plugin by default. You will need to manually select ‘load always’ in the plugin manager to get the old behaviour.

A.9.1 Major New Features

JAPE Language Improvements

Several new extensions to the JAPE language to support more flexible pattern matching. Full details are in Chapter 8 but briefly:

Some of these extensions are similar to, but not the same as, those provided by the Montreal Transducer plugin. If you are already familiar with the Montreal Transducer, you should first look at Section 8.10 which summarises the differences.

Resource Configuration via Java 5 Annotations

Introduced an alternative style for supplying resource configuration information via Java 5 annotations rather than in creole.xml. The previous approach is still fully supported as well, and the two styles can be freely mixed. See Section 4.7 for full details.

Ontology-Based Gazetteer

Added a new plugin ‘Gazetteer_Ontology_Based’, which contains OntoRoot Gazetteer – a dynamically created gazetteer which is, in combination with few other generic resources, capable of producing ontology-aware annotations over the given content with regards to the given ontology. For more details see Section 13.8.

Inter-Annotator Agreement and Merging

New plugins to support tasks involving several annotators working on the same annotation task on the same documents. The plugin ‘Inter_Annotator_Agreement’ (Section 10.5) computes inter-annotator agreement scores between the annotators, the ‘Copy_Annots_Between_Docs’ plugin (Section 22.22) copies annotations from several parallel documents into a single master document, and the ‘Annotation_Merging’ plugin (Section 22.21) merges annotations from multiple annotators into a single ‘consensus’ annotation set.

Packaging Self-Contained Applications for GATE Teamware

Added a mechanism to assemble a saved GATE application along with all the resource files it uses into a single self-contained package to run on another machine (e.g. as a service in GATE Teamware). This is available as a menu option (Section 3.9.4) which will work for most common cases, but for complex cases you can use the underlying Ant task described in Section E.2.

GUI Improvements

A.9.2 Other New Features and Improvements

A.9.3 Specific Bug Fixes

Plus many more minor bug fixes

A.10 Version 4.0 (July 2007) [#]

A.10.1 Major New Features

ANNIC

ANNotations In Context: a full-featured annotation indexing and retrieval system designed to support corpus querying and JAPE rule authoring. It is provided as part of an extension of the Serial Datastores, called Searchable Serial Datastore (SSD). See Section 9 for more details.

New Machine Learning API

A brand new machine learning layer specifically targetted at NLP tasks including text classification, chunk learning (e.g. for named entity recognition) and relation learning. See Chapter 18 for more details.

Ontology API

A new ontology API, based on OWL In Memory (OWLIM), which offers a better API, revised ontology event model and an improved ontology editor to name but few. See Chapter 14 for more details.

OCAT

Ontology-based Corpus Annotation Tool to help annotators to manually annotate documents using ontologies. For more details please see Section 14.6.

Alignment Tools

A new set of components (e.g. CompoundDocument, AlignmentEditor etc.) that help in building alignment tools and in carrying out cross-document processing. See Chapter 19 for more details.

New HTML Parser

A new HTML document format parser, based on Andy Clark’s NekoHTML. This parser is much better than the old one at handling modern HTML and XHTML constructs, JavaScript blocks, etc., though the old parser is still available for existing applications that depend on its behaviour.

Java 5.0 Support

GATE now requires Java 5.0 or later to compile and run. This brings a number of benefits:

A.10.2 Other New Features and Improvements

A.10.3 Bug Fixes and Optimizations

And as always there are many smaller bugfixes too numerous to list here...

A.11 Version 3.1 (April 2006)

A.11.1 Major New Features

Support for UIMA

UIMA (http://www.research.ibm.com/UIMA/) is a language processing framework developed by IBM. UIMA and GATE share some functionality but are complementary in most respects. GATE now provides an interoperability layer to allow UIMA applications to include GATE components in their processing and vice-versa. For full information, see Chapter21.

New Ontology API

The ontology layer has been rewritten in order to provide an abstraction layer between the model representation and the tools used for input and output of the various representation formats. An implementation that uses Jena 2 (http://jena.sourceforge.net/ontology) for reading and writing OWL and RDF(S) is provided.

Ontotext Japec Compiler

Japec is a compiler for JAPE grammars developed by Ontotext Lab. It has some limitations compared to the standard JAPE transducer implementation, but can run JAPE grammars up to five times as fast. By default, GATE still uses the stable JAPE implementation, but if you want to experiment with Japec, see Section C.1.

A.11.2 Other New Features and Improvements

A.11.3 Bug Fixes

A.12 January 2005

Release of version 3.

New plugins for processing in various languages (see 15). These are not full IE systems but are designed as starting points for further development (French, German, Spanish, etc.), or as sample or toy applications (Cebuano, Hindi, etc.).

Other new plugins:

Support for SVM Light, a support vector machine implementation, has been added to the machine learning plugin ‘Learning’ (see section 18.3.5).

A.13 December 2004

GATE no longer depends on the Sun Java compiler to run, which means it will now work on any Java runtime environment of at least version 1.4. JAPE grammars are now compiled using the Eclipse JDT Java compiler by default.

A welcome side-effect of this change is that it is now much easier to integrate GATE-based processing into web applications in Tomcat. See Section 7.16 for details.

A.14 September 2004

GATE applications are now saved in XML format using the XStream library, rather than by using native java serialization. On loading an application, GATE will automatically detect whether it is in the old or the new format, and so applications in both formats can be loaded. However, older versions of GATE will be unable to load applications saved in the XML format. (A java.io.StreamCorruptedException: invalid stream header exception will occcur.) It is possible to get new versions of GATE to use the old format by setting a flag in the source code. (See the Gate.java file for details.) This change has been made because it allows the details of an application to be viewed and edited in a text editor, which is sometimes easier than loading the application into GATE.

A.15 Version 3 Beta 1 (August 2004)

Version 3 incorporates a lot of new functionality and some reorganisation of existing components.

Note that Beta 1 is feature-complete but needs further debugging (please send us bug reports!).

Highlights include: completely rewritten document viewer/editor; extensive ontology support; a new plugin management system; separate .jar files and a Tomcat classloading fix; lots more CREOLE components (and some more to come soon).

Almost all the changes are backwards-compatible; some recent classes have been renamed (particularly the ontologies support classes) and a few events added (see below); datastores created by version 3 will probably not read properly in version 2. If you have problems use the mailing list and we’ll help you fix your code!

The gorey details:

A.16 July 2004

GATE documents now fire events when the document content is edited. This was added in order to support the new facility of editing documents from the GUI. This change will break backwards compatibility by requiring all DocumentListener implementations to implement a new method:
public void contentEdited(DocumentEvent e);

A.17 June 2004

A new algorithm has been implemented for the AnnotationDiff function. A new, more usable, GUI is included, and an ‘Export to HTML’ option added. More details about the AnnotationDiff tool are in Section 10.2.1.

A new build process, based on ANT (http://ant.apache.org/) is now available. The old build process, based on make, is now unsupported. See Section 2.6 for details of the new build process.

A Jape Debugger from Ontos AG has been integrated. You can turn integration ON with command line option ‘-j’. If you run GATE Developer with this option, the new menu item for Jape Debugger GUI will appear in the Tools menu. The default value of integration is OFF. We are currently awaiting documentation for this.

NOTE! Keep in mind there is ClassCastException if you try to debug ConditionalCorpusPipeline. Jape Debugger is designed for Corpus Pipeline only. The Ontos code needs to be changed to allow debugging of ConditionalCorpusPipeline.

A.18 April 2004

There are now two alternative strategies for ontology-aware grammar transduction:

The changes are in:

More information about the ontology-aware transducer can be found in Section 14.10.

A morphological analyser PR has been added. This finds the root and affix values of a token and adds them as features to that token.

A flexible gazetteer PR has been added. This performs lookup over a document based on the values of an arbitrary feature of an arbitrary annotation type, by using an externally provided gazetteer. See 13.6 for details.

A.19 March 2004

Support was added for the MAXENT machine learning library. (See 18.3.4 for details.)

A.20 Version 2.2 – August 2003

Note that GATE 2.2 works with JDK 1.4.0 or above. Version 1.4.2 is recommended, and is the one included with the latest installers.

GATE has been adapted to work with Postgres 7.3. The compatibility with PostgreSQL 7.2 has been preserved.

Note that as of Version 5.1 PostgreSQL is no longer supported.

New library version – Lucene 1.3 (rc1)

A bug in gate.util.Javac has been fixed in order to account for situations when String literals require an encoding different from the platform default.

Temporary .java files used to compile JAPE RHS actions are now saved using UTF-8 and the ‘-encoding UTF-8’ option is passed to the javac compiler.

A custom tools.jar is no longer necessary

Minor changes have been made to the look and feel of GATE Developer to improve its appearance with JDK 1.4.2

A.21 Version 2.1 – February 2003

Integration of Machine Learning PR and WEKA wrapper (see Section 18.3).

Addition of DAML+OIL exporter.

Integration of WordNet (see Section 22.19).

The syntax tree viewer has been updated to fix some bugs.

A.22 June 2002

Conditional versions of the controllers are now available (see Section 3.8.2). These allow processing resources to be run conditionally on document features.

PostgreSQL Datastores are now supported.

These store data into a PostgreSQL RDBMS.

(As of Version 5.1 PostgreSQL is no longer supported.)

Addition of OntoGazetteer (see Section 13.3), an interface which makes ontologies visible within GATE Developer, and supports basic methods for hierarchy management and traversal.

Integration of Protégé, so that people with developed Protégé ontologies can use them within GATE.

Addition of IR facilities in GATE (see Section 22.17).

Modification of the corpus benchmark tool (see Section 10.4.3), which now takes an application as a parameter.

See also for details of other recent bug fixes.

1Existing saved applications using the controller parameter will still work provided the controller in question implements the LanguageAnalyser interface. The CorpusController implementations supplied as standard with GATE all implement this interface.

2http://code.google.com/p/boilerpipe/