Developing Language Processing
Components with GATE
Version 5 (a User Guide)
For GATE version 6.0-snapshot (development builds)
(built July 27, 2010)
Hamish Cunningham
Diana Maynard
Kalina Bontcheva
Valentin Tablan
Niraj Aswani
Ian Roberts
Genevieve Gorrell
Adam Funk
Angus Roberts
Danica Damljanovic
Thomas Heitz
Mark Greenwood
Horacio Saggion
Johann Petrak
Yaoyong Li
Wim Peters
et al
ⒸThe University of Sheffield 2001-2010
http://gate.ac.uk/
Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT) and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), Matrixware, the Information Retrieval Facility and several EU-funded projects (SEKT, TAO, NeOn, MediaCampaign, MUSING, KnowledgeWeb, PrestoSpace, h-TechSight, enIRaF).
Developing Language Processing Components with GATE Version 5.2
Ⓒ2010 The University of Sheffield
Department of Computer Science
Regent Court
211 Portobello
Sheffield
S1 4DP
http://gate.ac.uk
This work is licenced under the Creative Commons Attribution-No Derivative Licence. You are free to copy, distribute, display, and perform the work under the following conditions:
With the understanding that:
For more information about the Creative Commons Attribution-No Derivative License, please visit this web address: http://creativecommons.org/licenses/by-nd/2.0/uk/
ISBN: TBA
Software documentation is like sex: when it is good, it is very, very good; and when it is bad, it is better than nothing. (Anonymous.)
There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies; the other way is to make it so complicated that there are no obvious deficiencies. (C.A.R. Hoare)
A computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. (The Structure and Interpretation of Computer Programs, H. Abelson, G. Sussman and J. Sussman, 1985.)
If you try to make something beautiful, it is often ugly. If you try to make something useful, it is often beautiful. (Oscar Wilde)1
GATE2 is an infrastructure for developing and deploying software components that process human language. It is nearly 15 years old and is in active use for all types of computational task involving human language. GATE excels at text analysis of all shapes and sizes. From large corporations to small startups, from €multi-million research consortia to undergraduate projects, our user community is the largest and most diverse of any system of this type, and is spread across all but one of the continents3.
GATE is open source free software; users can obtain free support from the user and developer community via GATE.ac.uk or on a commercial basis from our industrial partners. We are the biggest open source language processing project with a development team more than double the size of the largest comparable projects (many of which are integrated with GATE4). More than €5 million has been invested in GATE development5; our objective is to make sure that this continues to be money well spent for all GATE’s users.
GATE has grown over the years to include a desktop client for developers, a workflow-based web application, a Java library, an architecture and a process. GATE is:
We also develop:
For more information see the family pages.
One of our original motivations was to remove the necessity for solving common engineering problems before doing useful research, or re-engineering before deploying research results into applications. Core functions of GATE take care of the lion’s share of the engineering:
On top of the core functions GATE includes components for diverse language processing tasks, e.g. parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. GATE Developer and Embedded are supplied with an Information Extraction system (ANNIE) which has been adapted and evaluated very widely (numerous industrial systems, research systems evaluated in MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.). ANNIE is often used to create RDF or OWL (metadata) for unstructured content (semantic annotation).
GATE version 1 was written in the mid-1990s; at the turn of the new millennium we completely rewrote the system in Java; version 5 was released in June 2009. We believe that GATE is the leading system of its type, but as scientists we have to advise you not to take our word for it; that’s why we’ve measured our software in many of the competitive evaluations over the last decade-and-a-half (MUC, TREC, ACE, DUC and more; see Section 1.4 for details). We invite you to give it a try, to get involved with the GATE community, and to contribute to human language science, engineering and development.
This book describes how to use GATE to develop language processing components, test their performance and deploy them as parts of other applications. In the rest of this chapter:
Note: if you don’t see the component you need in this document, or if we mention a component that you can’t see in the software, contact gate-users@lists.sourceforge.net7 – various components are developed by our collaborators, who we will be happy to put you in contact with. (Often the process of getting a new component is as simple as typing the URL into GATE Developer; the system will do the rest.)
The material presented in this book ranges from the conceptual (e.g. ‘what is software architecture?’) to practical instructions for programmers (e.g. how to deal with GATE exceptions) and linguists (e.g. how to write a pattern grammar). Furthermore, GATE’s highly extensible nature means that new functionality is constantly being added in the form of new plugins. Important functionality is as likely to be located in a plugin as it is to be integrated into the GATE core. This presents something of an organisational challenge. Our (no doubt imperfect) solution is to divide this book into three parts. Part I covers installation, using the GATE Developer GUI and using ANNIE, as well as providing some background and theory. We recommend the new user to begin with Part I. Part II covers the more advanced of the core GATE functionality; the GATE Embedded API and JAPE pattern language among other things. Part III provides a reference for the numerous plugins that have been created for GATE. Although ANNIE provides a good starting point, the user will soon wish to explore other resources, and so will need to consult this part of the text. We recommend that Part III be used as a reference, to be dipped into as necessary. In Part III, plugins are grouped into broad areas of functionality.
GATE can be thought of as a Software Architecture for Language Engineering [Cunningham 00].
‘Software Architecture’ is used rather loosely here to mean computer infrastructure for software development, including development environments and frameworks, as well as the more usual use of the term to denote a macro-level organisational structure for software systems [Shaw & Garlan 96].
Language Engineering (LE) may be defined as:
…the discipline or act of engineering software systems that perform tasks involving processing human language. Both the construction process and its outputs are measurable and predictable. The literature of the field relates to both application of relevant scientific results and a body of practice. [Cunningham 99a]
The relevant scientific results in this case are the outputs of Computational Linguistics, Natural Language Processing and Artificial Intelligence in general. Unlike these other disciplines, LE, as an engineering discipline, entails predictability, both of the process of constructing LE-based software and of the performance of that software after its completion and deployment in applications.
Some working definitions:
(Of course the practice of these fields is broader and more complex than these definitions.)
In the scientific endeavours of NLP and CL, GATE’s role is to support experimentation. In this context GATE’s significant features include support for automated measurement (see Chapter 10), providing a ‘level playing field’ where results can easily be repeated across different sites and environments, and reducing research overheads in various ways.
GATE as an architecture suggests that the elements of software systems that process natural language can usefully be broken down into various types of component, known as resources8. Components are reusable software chunks with well-defined interfaces, and are a popular architectural form, used in Sun’s Java Beans and Microsoft’s .Net, for example. GATE components are specialised types of Java Bean, and come in three flavours:
These definitions can be blurred in practice as necessary.
Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection of REusable Objects for Language Engineering. All the resources are packaged as Java Archive (or ‘JAR’) files, plus some XML configuration data. The JAR and XML files are made available to GATE by putting them on a web server, or simply placing them in the local file space. Section 1.3.2 introduces GATE’s built-in resource set.
When using GATE to develop language processing functionality for an application, the developer uses GATE Developer and GATE Embedded to construct resources of the three types. This may involve programming, or the development of Language Resources such as grammars that are used by existing Processing Resources, or a mixture of both. GATE Developer is used for visualisation of the data structures produced and consumed during processing, and for debugging, performance measurement and so on. For example, figure 1.1 is a screenshot of one of the visualisation tools.
GATE Developer is analogous to systems like Mathematica for Mathematicians, or JBuilder for Java programmers: it provides a convenient graphical environment for research and development of language processing software.
When an appropriate set of resources have been developed, they can then be embedded in the target client application using GATE Embedded. GATE Embedded is supplied as a series of JAR files.9 To embed GATE-based language processing facilities in an application, these JAR files are all that is needed, along with JAR files and XML configuration files for the various resources that make up the new facilities.
GATE includes resources for common LE data structures and algorithms, including documents, corpora and various annotation types, a set of language analysis components for Information Extraction and a range of data visualisation and editing components.
GATE supports documents in a variety of formats including XML, RTF, email, HTML, SGML and plain text. In all cases the format is analysed and converted into a single unified model of annotation. The annotation format is a modified form of the TIPSTER format [Grishman 97] which has been made largely compatible with the Atlas format [Bird & Liberman 99], and uses the now standard mechanism of ‘stand-off markup’. GATE documents, corpora and annotations are stored in databases of various sorts, visualised via the development environment, and accessed at code level via the framework. See Chapter 5 for more details of corpora etc.
A family of Processing Resources for language analysis is included in the shape of ANNIE, A Nearly-New Information Extraction system. These components use finite state techniques to implement various tasks from tokenisation to semantic tagging or verb phrase chunking. All ANNIE components communicate exclusively via GATE’s document and annotation resources. See Chapter 6 for more details. Other CREOLE resources are described in Part III.
Three other facilities in GATE deserve special mention:
And by version 4 it will make a mean cup of tea.
This section gives a very brief example of a typical use of GATE to develop and deploy language processing capabilities in an application, and to generate quantitative results for scientific publication.
Let’s imagine that a developer called Fatima is building an email client11 for Cyberdyne Systems’ large corporate Intranet. In this application she would like to have a language processing system that automatically spots the names of people in the corporation and transforms them into mailto hyperlinks.
A little investigation shows that GATE’s existing components can be tailored to this purpose. Fatima starts up GATE Developer, and creates a new document containing some example emails. She then loads some processing resources that will do named-entity recognition (a tokeniser, gazetteer and semantic tagger), and creates an application to run these components on the document in sequence. Having processed the emails, she can see the results in one of several viewers for annotations.
The GATE components are a decent start, but they need to be altered to deal specially with people from Cyberdyne’s personnel database. Therefore Fatima creates new ‘cyber-’ versions of the gazetteer and semantic tagger resources, using the ‘bootstrap’ tool. This tool creates a directory structure on disk that has some Java stub code, a Makefile and an XML configuration file. After several hours struggling with badly written documentation, Fatima manages to compile the stubs and create a JAR file containing the new resources. She tells GATE Developer the URL of these files12, and the system then allows her to load them in the same way that she loaded the built-in resources earlier on.
Fatima then creates a second copy of the email document, and uses the annotation editing facilities to mark up the results that she would like to see her system producing. She saves this and the version that she ran GATE on into her serial datastore. From now on she can follow this routine:
To make the alterations that she requires, Fatima re-implements the ANNIE gazetteer so that it regenerates itself from the local personnel data. She then alters the pattern grammar in the semantic tagger to prioritise recognition of names from that source. This latter job involves learning the JAPE language (see Chapter 8), but as this is based on regular expressions it isn’t too difficult.
Eventually the system is running nicely, and her accuracy is 93% (there are still some problem cases, e.g. when people use nicknames, but the performance is good enough for production use). Now Fatima stops using GATE Developer and works instead on embedding the new components in her email application using GATE Embedded. This application is written in Java, so embedding is very easy13: the GATE JAR files are added to the project CLASSPATH, the new components are placed on a web server, and with a little code to do initialisation, loading of components and so on, the job is finished in half a day – the code to talk to GATE takes up only around 150 lines of the eventual application, most of which is just copied from the example in the sheffield.examples.StandAloneAnnie class.
Because Fatima is worried about Cyberdyne’s unethical policy of developing Skynet to help the large corporates of the West strengthen their strangle-hold over the World, she wants to get a job as an academic instead (so that her conscience will only have to cope with the torture of students, as opposed to humanity). She takes the accuracy measures that she has attained for her system and writes a paper for the Journal of Nasturtium Logarithm Incitement describing the approach used and the results obtained. Because she used GATE for development, she can cite the repeatability of her experiments and offer access to example binary versions of her software by putting them on an external web server.
And everybody lived happily ever after.
This section contains an incomplete list of publications describing systems that used GATE in competitive quantitative evaluation programmes. These programmes have had a significant impact on the language processing field and the widespread presence of GATE is some measure of the maturity of the system and of our understanding of its likely performance on diverse text processing tasks.
This section logs changes in the latest version of GATE. Appendix A provides a complete change log.
Added a new scriptable controller to the Groovy plugin, whose execution strategy is controlled by a simple Groovy DSL. This supports more powerful conditional execution than is possible with the standard conditional controllers (for example, based on the presence or absence of a particular annotation, or a combination of several document feature values), rich flow control using Groovy loops, etc. See section 7.16.3 for details.
A new version of Alignment Editor has been added to the GATE distribution. It consists of several new features such as the new alignment viewer, ability to create alignment tasks and store in xml files, three different views to align the text (links view and matrix view - suitable for character, word and phrase alignments, parallel view - suitable for sentence or long text alignment), an alignment exporter and many more. See chapter 16 for more information.
A processing resource called Quality Assurance PR has been added in the Tools plugin. The PR wraps the functionality of the Quality Assurance Tool (section 10.3).
A new section for using the Corpus Quality Assurance from GATE Embedded has been written. See section 10.3.
A new plugin called Web_Translate_Google has been added with a PR called Google Translator PR in it. It allows users to translate text using the Google translation services. See section 19.8 for more information.
Added new parameters and options to the LingPipe Language Identifier PR. (section 19.16.5).
In the document editor, fixed several exceptions to make editing text with annotations highlighted working. So you should now be able to edit the text and the annotations should behave correctly that is to say move, expand or disappear according to the text insertions and deletions.
Options for document editor: read-only and insert append/prepend have been moved from the options dialogue to the document editor toolbar at the top right on the triangle icon that display a menu with the options. See section 3.2.
New Gazetteer Editor for ANNIE Gazetteer that can be used instead of Gaze. It uses tables instead of text area to display the gazetteer definition and lists, allows sorting on any column, filtering of the lists, reloading a list, etc. See section 13.2.2.
Added new parameters and options to the Crawl PR and document features to its output; see section 19.5 for details.
Fixed a bug where ontology-aware JAPE rules worked correctly when the target annotation’s class was a subclass of the class specified in the rul, but failed when the two class names matched exactly.
Added the current Corpus to the script binding for the Groovy Script PR, allowing a Groovy script to access and set corpus-level features. Also added callbacks that a Groovy script can implement to do additional pre- or post-processing before the first and after the last document in a corpus. See section 7.16 for details.
Corrected the documentation for the LingPipe POS Tagger (section 19.16.3).
MetaMap, from the National Library of Medicine (NLM), maps biomedical text to the UMLS Metathesaurus and allows Metathesaurus concepts to be discovered in a text corpus. The Tagger_MetaMap plugin for GATE wraps the MetaMap Java API client to allow GATE to communicate with a remote (or local) MetaMap PrologBeans mmserver and MetaMap distribution. This allows the content of specified annotations (or the entire document content) to be processed by MetaMap and the results converted to GATE annotations and features. See section 19.18 for details.
This is a bugfix release to resolve several bugs that were reported shortly after the release of version 5.2:
This release also fixes some shortcomings in the Groovy support added by 5.2, in particular:
Introduced a utility class gate.Utils containing static utility methods for frequently-used idioms such as getting the string covered by an annotation, finding the start and end offsets of annotations and sets, etc. This class is particularly useful on the right hand side of JAPE rules (section 8.6.5).
Added type parameters to the bindings map available on the RHS of JAPE rules, so you can now do AnnotationSet as = bindings.get("label") without a cast (see section 8.6.5).
Fixed a bug with JAPE’s handling of features called “class” in non-ontology-aware mode. Previously JAPE would always match such features using an equality test, even if a different operator was used in the grammar, i.e. {SomeType.class != "foo"} was matched as {SomeType.class == "foo"}. The correct operator is now used. Note that this does not affect the ontology-aware behaviour: when an ontology parameter is specified, “class” features are always matched using ontology subsumption.
Custom JAPE operators and annotation accessors can now be loaded from plugins as well as from the lib directory (see section 8.2.2).
Added a mechanism to allow plugins to contribute menu items to the “Tools” menu in GATE Developer. See section 4.8 for details.
Enhanced Groovy support in GATE: the Groovy console and Groovy Script PR (in the Groovy plugin) now import many GATE classes by default, and a number of utility methods are mixed in to some of the core GATE API classes to make them more natural to use in Groovy. See section 7.16 for details.
Modified the batch learning PR (in the Learning plugin) to make it safe to use several instances in APPLICATION mode with the same configuration file and the same learned model at the same time (e.g. in a multithreaded application). The other modes (including training and evaluation) are unchanged, and thus are still not safe for use in this way. Also fixed a bug that prevented APPLICATION mode from working anywhere other than as the last PR in a pipeline when running over a corpus in a datastore.
Introduced a simple way to create duplicate copies of an existing resource instance, with a way for individual resource types to override the default duplication algorithm if they know a better way to deal with duplicating themselves. See section 7.7.
Enhanced the Spring support in GATE to provide easy access to the new duplication API, and to simplify the configuration of the built-in Spring pooling mechanisms when writing multi-threaded Spring-based applications. See section 7.14.
The GAPP packager Ant task now respects the ordering of mapping hints, with earlier hints taking precedence over later ones (see section E.2.3).
Bug fix in the UIMA plugin from Roland Cornelissen - AnalysisEnginePR now properly shuts down the wrapped AnalysisEngine when the PR is deleted.
Patch from Matt Nathan to allow several instances of a gazetteer PR in an embedded application to share a single copy of their internal data structures, saving considerable memory compared to loading several complete copies of the same gazetteer lists (see section 13.11).
In the corpus quality assurance, measures for classification tasks have been added. You can also now set the beta for the fscore. This tool has been optimised to work with datastores so that it doesn’t need to read all the documents before comparing them.
Lots of documentation lives on the GATE web server, including:
For more details about Sheffield University’s work in human language processing see the NLP group pages or A Definition and Short History of Language Engineering ([Cunningham 99a]). For more details about Information Extraction see IE, a User Guide or the GATE IE pages.
A list of publications on GATE and projects that use it (some of which are available on-line):
2009
2008
2007
2006
2005
2004
2003
2002
Older than 2002
To download GATE point your web browser at http://gate.ac.uk/download/.
GATE will run anywhere that supports Java 5 or later, including Solaris, Linux, Mac OS X and Windows platforms. We don’t run tests on other platforms, but have had reports of successful installs elsewhere.
The easy way to install is to use one of the platform-specific installers (created using the excellent IzPack). Download a ‘platform-specific installer’ and follow the instructions it gives you. Once the installation is complete, you can start GATE Developer using gate.exe (Windows) or GATE.app (Mac) in the top-level installation directory, or gate.sh in the bin directory (other platforms).
Note for Mac users: on 64-bit-capable systems, GATE.app will run as a 64-bit application. It will use the first listed 64-bit JVM in your Java Preferences, even if your highest priority JVM is a 32-bit one. Thus if you want to run using Java 5 rather than 6 you must ensure that “J2SE 5.0 64-bit” is listed ahead of “Java SE 6 64-bit”.
Download the Java-only release package or the binary build snapshot, and follow the instructions below.
Prerequisites:
available free from Sun Microsystems or from your UNIX supplier. (We test on various Sun JDKs on Solaris, Linux and Windows XP.)
You will also need the lib directory, containing various libraries that GATE depends on.
Using the binary distribution:
The Ant scripts that start GATE Developer (ant.bat or ant) require you to set the JAVA_HOME environment variable to point to the top level directory of your JAVA installation. The value of GATE_CONFIG is passed to the system by the scripts using either a -i command-line option, or the Java property gate.config.
The GATE code is maintained in a Subversion repository. You can use a Subversion
client to check out the source code – the most up-to-date version of GATE is the trunk:
svn checkout https://gate.svn.sourceforge.net/svnroot/gate/gate/trunk gate
Once you have checked out the code you can build GATE using Ant (see Section 2.5)
You can browse the complete Subversion repository online at http://gate.svn.sourceforge.net/.
During initialisation, GATE reads several Java system properties in order to decide where to find its configuration files.
Here is a list of the properties used, their default values and their meanings:
When using GATE Embedded, you can set the values for these properties before you call Gate.init(). Alternatively, you can set the values programmatically using the static methods setGateHome(), setPluginsHome(), setSiteConfigFile(), etc. before calling Gate.init(). See the Javadoc documentation for details. If you want to set these values from the command line you can use the following syntax for setting gate.home for example:
java -Dgate.home=/my/new/gate/home/directory -cp... gate.Main
When running GATE Developer, you can set the properties by creating a file build.properties in the top level GATE directory. In this file, any system properties which are prefixed with ‘run.’ will be passed to GATE. For example, to set an alternative user config file, put the following line in build.properties1:
run.gate.user.config=${user.home}/alternative-gate.xml
This facility is not limited to the GATE-specific properties listed above, for example the following line changes the default temporary directory for GATE (note the use of forward slashes, even on Windows platforms):
run.java.io.tmpdir=d:/bigtmp
When GATE Developer is started, or when Gate.init() is called from GATE Embedded, GATE loads various sorts of configuration data stored as XML in files generally called something like gate.xml or .gate.xml. This data holds information such as:
This type of data is stored at two levels (in order from general to specific):
Where configuration data appears on several different levels, the more specific ones overwrite the more general. This means that you can set defaults for all GATE users on your system, for example, and allow individual users to override those defaults without interfering with others.
Configuration data can be set from the GATE Developer GUI via the ‘Options’ menu then ‘Configuration’. The user can change the appearance of the GUI in the ‘Appearance’ tab, which includes the options of font and the ‘look and feel’. The ‘Advanced’ tab enables the user to include annotation features when saving the document and preserving its format, to save the selected Options automatically on exit, and to save the session automatically on exit. The ‘Input Methods’ submenu from the ‘Options’ menu enables the user to change the default language for input. These options are all stored in the user’s .gate.xml file.
When using GATE Embedded, you can also set the site config location using Gate.setSiteConfigFile(File) prior to calling Gate.init().
Note that you don’t need to build GATE unless you’re doing development on the system itself.
Prerequisites:
GATE now includes a copy of the ANT build tool which can be accessed through the scripts included in the bin directory (use ant.bat for Windows 98 or ME, ant.cmd for Windows NT, 2000 or XP, and ant.sh for Unix platforms).
To build gate, cd to gate and:
(The details of the build process are all specified by the build.xml file in the gate directory.)
You can also use a development environment like Borland JBuilder (click on the gate.jpx file), but note that it’s still advisable to use ant to generate documentation, the jar file and so on. Also note that the run configurations have the location of a gate.xml site configuration file hard-coded into them, so you may need to change these for your site.
This section is based on contributions by Georg �ttl and William Oberman.
To use GATE with Maven you need a definition of the dependencies in POM format. There’s an example POM here.
To use GATE with JPF (a Java plugin framework) you need a plugin definition like this one.
If you have used the installer, run:
or just delete the whole of the installation directory (the one containing bin, lib, Uninstaller, etc.). The installer doesn’t install anything outside this directory, but for completeness you might also want to delete the settings files GATE creates in your home directory (.gate.xml and .gate.session).
Note that the gate.bat script uses javaw.exe to run GATE which means that you will see no console for the java process. If you have problems starting GATE and you would like to be able to see the console to check for messages then you should edit the gate.bat script and replace javaw.exe with java.exe in the definition of the JAVA environment variable.
You might get some clues if you start GATE from the command line, using:
which will allow you to see all error messages GATE generates.
GATE and many other Java applications are known not to work with GCJ, the open-source Java SDK or others non SUN Java SDK.
Make sure you have the official version of Java installed. Provided by Sun, the package is named ‘sun-java6-jdk’ in Synaptic. GATE also works with Java version 5 so ‘sun-java5-jdk’.
To install it, run in a terminal:
Make sure that your default Java version is the one from SUN. You can do this by running:
This will list the installed Java VMs. You should see ‘java-6-sun’ as one of the options.
Then you should run :
to set the ‘java-6-sun’ as your default.
Finally, try GATE again.
32-bit vs. 64-bit is a matter of the JVM rather than the build of GATE -
For example, on Mac OS X, either use Applications/Utilities/Java Preferences and put one of the 64-bit options at the top of the list, or run GATE from the terminal using Java 1.6.0 (which is 64-bit only on Mac OS):
GATE doesn’t use the JAVA_OPTS variable. The default memory allocations are defined in the gate/build.xml file but you can override them by creating a file called build.properties in the same directory containing
If you don’t use ant to start GATE but your own application directly with the ‘java’ executable then you must use something like:
Configuring xms and xmx parameters in eclipse.ini file just adds memory to your Eclipse process. If you start a Java application from within Eclipse, that will run in a different process.
To give more memory to your application, as opposed to just to Eclipse, you need to add those values in the ‘VM Arguments’ section of the run application dialog: lower pane, in the second tab of ‘Run Configurations’ dialog.
You can try to set the environment variable ANT_OPTS to allow for more memory as follows:
Another cause can be when compiling with Java 6 on Mac. It builds OK using Java 5 with
and once built it runs fine with Java 6. Adding
or similar to the <javac> in build.xml might fix it but if you’re making changes that are to be committed to subversion you really ought to be building with Java 5 anyway :-)
You need to copy the ‘gate/bin/log4j.properties’ file to the directory from which you execute your project.
Change the look and feel used in GATE with menu ‘Options’ then ‘Configuration’. Restart GATE and try again. We use mainly ‘Metal’ and ‘Nimbus’ without problem.
Change the video driver you use.
Update Java.
‘The law of evolution is that the strongest survives!’
‘Yes; and the strongest, in the existence of any social species, are those who are most social. In human terms, most ethical. …There is no strength to be gained from hurting one another. Only weakness.’
The Dispossessed [p.183], Ursula K. le Guin, 1974.
This chapter introduces GATE Developer, which is the GATE graphical user interface. It is analogous to systems like Mathematica for mathematicians, or Eclipse for Java programmers, providing a convenient graphical environment for research and development of language processing software. As well as being a powerful research tool in its own right, it is also very useful in conjunction with GATE Embedded (the GATE API by which GATE functionality can be included in your own applications); for example, GATE Developer can be used to create applications that can then be embedded via the API. This chapter describes how to complete common tasks using GATE Developer. It is intended to provide a good entry point to GATE functionality, and so explanations are given assuming only basic knowledge of GATE. However, probably the best way to learn how to use GATE Developer is to use this chapter in conjunction with the demonstrations and tutorials movies. There are specific links to them throughout the chapter. There is also a complete new set of video tutorials here.
The basic business of GATE is annotating documents, and all the functionality we will introduce relates to that. Core concepts are;
What is considered to be the end result of the process varies depending on the task, but for the purposes of this chapter, output takes the form of the annotated document/corpus. Researchers might be more interested in figures demonstrating how successfully their application compares to a ‘gold standard’ annotation set; Chapter 10 in Part II will cover ways of comparing annotation sets to each other and obtaining measures such as F1. Implementers might be more interested in using the annotations programmatically; Chapter 7, also in Part II, talks about working with annotations from GATE Embedded. For the purposes of this chapter, however, we will focus only on creating the annotated documents themselves, and creating GATE applications for future use.
GATE includes a complete information extraction system that you are free to use, called ANNIE (a Nearly-New Information Extraction System). Many users find this is a good starting point for their own application, and so we will cover it in this chapter. Chapter 6 talks in a lot more detail about the inner workings of ANNIE, but we aim to get you started using ANNIE from inside of GATE Developer in this chapter.
We start the chapter with an exploration of the GATE Developer GUI, in Section 3.1. We describe how to create documents (Section 3.2) and corpora (Section 3.3). We talk about viewing and manually creating annotations (Section 3.4).
We then talk about loading the plugins that contain the processing resources you will use to construct your application, in Section 3.5. We then talk about instantiating processing resources (Section 3.6). Section 3.7 covers applications, including using ANNIE (Section 3.7.3). Saving applications and language resources (documents and corpora) is covered in Section 3.8. We conclude with a few assorted topics that might be useful to the GATE Developer user, in Section 3.10.
Figure 3.1 shows the main window of GATE Developer, as you will see it when you first run it. There are five main areas:
The menu and the messages bar do the usual things. Longer messages are displayed in the messages tab in the main resource viewer area.
The resource tree and resource viewer areas work together to allow the system to display diverse resources in various ways. The many resources integrated with GATE can have either a small view, a large view, or both.
At any time, the main viewer can also be used to display other information, such as messages, by clicking on the appropriate tab at the top of the main window. If an error occurs in processing, the messages tab will flash red, and an additional popup error message may also occur.
In the options dialogue from the Options menu you can choose if you want to link the selection in the resources tree and the selected main view.
If you right-click on ‘Language Resources’ in the resources pane, select “New’ then ‘GATE Document’, the window ‘Parameters for the new GATE Document’ will appear as shown in figure 3.2. Here, you can specify the GATE document to be created. Required parameters are indicated with a tick. The name of the document will be created for you if you do not specify it. Enter the URL of your document or use the file browser to indicate the file you wish to use for your document source. For example, you might use ‘http://www.gate.ac.uk’, or browse to a text or XML file you have on disk. Click on ‘OK’ and a GATE document will be created from the source you specified.
See also the movie for creating documents.
The document editor is contained in the central tabbed pane in GATE Developer. Double-click on your document in the resources pane to view the document editor. The document editor consists of a top panel with buttons and icons that control the display of different views and the search box. Initially, you will see just the text of your document, as shown in figure 3.3. Click on ‘Annotation Sets’ and ‘Annotations List’ to view the annotation sets to the right and the annotations list at the bottom. You will see a view similar to figure 3.4. In place of the annotations list, you can also choose to see the annotations stack. In place of the annotation sets, you can also choose to view the co-reference editor. More information about this functionality is given in Section 3.4.
Several options can be set from the triangle icon at the top right corner.
With ‘Save Current Layout’ you store the way the different views are shown and the annotation types highlighted in the document. Then if you set ‘Restore Layout Automatically’ you will get the same views and annotation types each time you open a document.
Another setting make the document editor ‘Read-only’. If enabled, you won’t be able to edit the text but you will still be able to edit annotations. It is useful to avoid to involuntarily modify the original text.
Finally you can choose between ‘Insert Append’ and ‘Insert Prepend’. That setting is only relevant when you’re inserting text at the very border of an annotation.
If you place the cursor at the start of an annotation, in one case the newly entered text will become part of the annotation, in the other case it will stay outside. If you place the cursor at the end of an annotation, the opposite will happen.
Let use this sentence: ‘This is an [annotation].’ with the square brackets [] denoting the boundaries of the annotation. If we insert a ‘x’ just before the ‘a’ or just after the ‘n’ of ‘annotation’, here’s what we get:
Append
Prepend
Text in a loaded document can be edited in the document viewer. The usual platform specific cut, copy and paste keyboard shortcuts should also work, depending on your operating system (e.g. CTRL-C, CTRL-V for Windows). The last icon, a magnifying glass, at the top of the document editor is for searching in the document. To prevent the new annotation windows popping up when a piece of text is selected, hold down the CTRL key. Alternatively, you can hide the annotation sets view by clicking on its button at the top of the document view; this will also cause the highlighted portions of the text to become un-highlighted.
See also Section 16.2.3 for the compound document editor.
You can create a new corpus in a similar manner to creating a new document; simply right-click on ‘Language Resources’ in the resources pane, select ‘New’ then ‘GATE corpus’. A brief dialogue box will appear in which you can optionally give a name for your corpus (if you leave this blank, a corpus name will be created for you) and optionally add documents to the corpus from those already loaded into GATE.
There are three ways of adding documents to a corpus:
Additionally, right-clicking on a loaded document in the tree and selecting the ‘New corpus with this document’ option creates a new transient corpus named Corpus for document name containing just this document.
See also the movie for creating and populating corpora.
Double click on your corpus in the resources pane to see the corpus editor, shown in figure 3.5. You will see a list of the documents contained within the corpus.
In the top left of the corpus editor, plus and minus buttons allow you to add documents to the corpus from those already loaded into GATE and remove documents from the corpus (note that removing a document from a corpus does not remove it from GATE).
Up and down arrows at the top of the view allow you to reorder the documents in the corpus. The rightmost button in the view opens the currently selected document in a document editor.
At the bottom, you will see that tabs entitled ‘Initialisation Parameters’ and ‘Corpus Quality Assurance’ are also available in addition to the corpus editor tab you are currently looking at. Clicking on the ‘Initialisation Parameters’ tab allows you to view the initialisation parameters for the corpus. The ‘Corpus Quality Assurance’ tab allows you to calculate agreement measures between the annotations in your corpus. Agreement measures are discussed in depth in Chapter 10. The use of corpus quality assurance is discussed in Section 10.3.
In this section, we will talk in more detail about viewing annotations, as well as creating and editing them manually. As discussed in at the start of the chapter, the main purpose of GATE is annotating documents. Whilst applications can be used to annotate the documents entirely automatically, annotation can also be done manually, e.g. by the user, or semi-automatically, by running an application over the corpus and then correcting/adding new annotations manually. Section 3.4.5 focuses on manual annotation. In Section 3.6 we talk about running processing resources on our documents. We begin by outlining the functionality around viewing annotations, organised by the GUI area to which the functionality pertains.
To view the annotation sets, click on the ‘Annotation Sets’ button at the top of the document editor, or use the F3 key (see Section 3.9 for more keyboard shortcuts). This will bring up the annotation sets viewer, which displays the annotation sets available and their corresponding annotation types.
The annotation sets view is displayed on the left part of the document editor. It’s a tree-like view with a root for each annotation set. The first annotation set in the list is always a nameless set. This is the default annotation set. You can see in figure 3.4 that there is a drop-down arrow with no name beside it. Other annotation sets on the document shown in figure 3.4 are ‘Key’ and ‘Original markups’. Because the document is an XML document, the original XML markup is retained in the form of an annotation set. This annotation set is expanded, and you can see that there are annotations for ‘TEXT’, ‘body’, ‘font’, ‘html’, ‘p’, ‘table’, ‘td’ and ‘tr’.
To display all the annotations of one type, tick its checkbox or use the space key. The text segments corresponding to these annotations will be highlighted in the main text window. To delete an annotation type, use the delete key. To change the color, use the enter key. There is a context menu for all these actions that you can display by right-clicking on one annotation type, a selection or an annotation set.
If you keep shift key pressed when you open the annotation sets view, GATE Developer will try to select any annotations that were selected in the previous document viewed (if any); otherwise no annotation will be selected.
Having selected an annotation type in the annotation sets view, hovering over an annotation in the main resource viewer or right-clicking on it will bring up a popup box containing a list of the annotations associated with it, from which one can select an annotation to view in the annotation editor, or if there is only one, the annotation editor for that annotation. Figure 3.6 shows the annotation editor.
To view the list of annotations and their features, click on the ‘Annotations list’ button at the top of the main window or use F4 key. The annotation list view will appear below the main text. It will only contain the annotations selected from the annotation sets view. These lists can be sorted in ascending and descending order for any column, by clicking on the corresponding column heading. Moreover you can hide a column by using the context menu by right-clicking on the column headings. Selecting rows in the table will blink the respective annotations in the document. Right-click on a row or selection in this view to delete or edit an annotation. Delete key is a shortcut to delete selected annotations.
This view is similar to the ANNIC view described in section 9.2. It displays annotations at the document caret position with some context before and after. The annotations are stacked from top to bottom, which gives a clear view when they are overlapping.
As the view is centred on the document caret, you can use the conventional keypresses to move it and update the view: notably the keys left and right to skip one letter; control + left/right to skip one word; up and down to go one line up or down; and use the document scrollbar then click in the document to move further. There are also two buttons at the top of the view that centre the view on the closest previous/next annotation boundary among all displayed. This is useful when you want to skip a region without annotation or when you want to reach the beginning or end of a very long annotation.
The annotation types displayed correspond to those selected in the annotation sets view. You can display feature values for an annotation rectangle by hovering the mouse on it or select only one feature to display by double-clicking on the annotation type in the first column.
Right-clicking on an annotation in the annotations stack view gives the option to edit that annotation.

The co-reference editor allows co-reference chains (see Section 6.9) to be displayed and edited in GATE Developer. To display the co-reference editor, first open a document in GATE Developer, and then click on the Co-reference Editor button in the document viewer.
The combo box at the top of the co-reference editor allows you to choose which annotation set to display co-references for. If an annotation set contains no co-reference data, then the tree below the combo box will just show ‘Coreference Data’ and the name of the annotation set. However, when co-reference data does exist, a list of all the co-reference chains that are based on annotations in the currently selected set is displayed. The name of each co-reference chain in this list is the same as the text of whichever element in the chain is the longest. It is possible to highlight all the member annotations of any chain by selecting it in the list.
When a co-reference chain is selected, if the mouse is placed over one of its member annotations, then a pop-up box appears, giving the user the option of deleting the item from the chain. If the only item in a chain is deleted, then the chain itself will cease to exist, and it will be removed from the list of chains. If the name of the chain was derived from the item that was deleted, then the chain will be given a new name based on the next longest item in the chain.
A combo box near the top of the co-reference editor allows the user to select an annotation type from the current set. When the Show button is selected all the annotations of the selected type will be highlighted. Now when the mouse pointer is placed over one of those annotations, a pop-up box will appear giving the user the option of adding the annotation to a co-reference chain. The annotation can be added to an existing chain by typing the name of the chain (as shown in the list on the right) in the pop-up box. Alternatively, if the user presses the down cursor key, a list of all the existing annotations appears, together with the option [New Chain]. Selecting the [New Chain] option will cause a new chain to be created containing the selected annotation as its only element.
Each annotation can only be added to a single chain, but annotations of different types can be added to the same chain, and the same text can appear in more than one chain if it is referenced by two or more annotations.
The movie for inspecting results is also useful for learning about viewing annotations.
To create annotations manually, select the text you want to annotate and hover the mouse on the selection or use control+E keys. A popup will appear, allowing you to create an annotation, as shown in figure 3.9
The type of the annotation, by default, will be the same as the last annotation you created, unless there is none, in which case it will be ‘_New_’. You can enter any annotation type name you wish in the text box, unless you are using schema-driven annotation (see Section 3.4.6). You can add or change features and their values in the table below.
To delete an annotation, click on the red X icon at the top of the popup window. To grow/shrink the span of the annotation at its start use the two arrow icons on the left or right and left keys. Use the two arrow icons next on the right to change the annotation end or alt+right and alt+left keys. Add shift and control+shift keys to make the span increment bigger. The red X icon is for removing the annotation.
The pin icon is to pin the window so that it remains where it is. If you drag and drop the window, this automatically pins it too. Pinning it means that even if you select another annotation (by hovering over it in the main resource viewer) it will still stay in the same position.
The popup menu only contains annotation types present in the Annotation Schema and those already listed in the relevant Annotation Set. To create a new Annotation Schema, see Section 3.4.6. The popup menu can be edited to add a new annotation type, however.
The new annotation created will automatically be placed in the annotation set that has been selected (highlighted) by the user. To create a new annotation set, type the name of the new set to be created in the box below the list of annotation sets, and click on ‘New’.
Figure 3.10 demonstrates adding a ‘Organization’ annotation for the string ‘EPSRC’ (highlighted in green) to the default annotation set (blank name in the annotation set view on the right) and a feature name ‘type’ with a value about to be added.
To add a second annotation to a selected piece of text, or to add an overlapping annotation to an existing one, press the CTRL key to avoid the existing annotation popup appearing, and then select the text and create the new annotation. Again by default the last annotation type to have been used will be displayed; change this to the new annotation type. When a piece of text has more than one annotation associated with it, on mouseover all the annotations will be displayed. Selecting one of them will bring up the relevant annotation popup.
To search and annotate the document automatically, use the search and annotate function as shown in figure 3.11:
Note that after using the [First] button you can move the caret in the document and use the [Next] button to avoid continuing the search from the beginning of the document. The [?] button at the end of the search text field will help you to build powerful regular expressions to search.
Annotation schemas allow annotation types and features to be pre-specified, so that during manual annotation, the relevant options appear on the drop-down lists in the annotation editor. You can see some example annotation schemas in Section 5.4.1. Annotation schemas provide a means to define types of annotations in GATE Developer. Basically this means that GATE Developer ‘knows about’ annotations defined in a schema.
Annotation schemas are supported by the ‘Annotation schema’ language resource in ANNIE, so to use them you must first ensure that the ‘ANNIE’ plugin is loaded (see Section 3.5). This will load a set of default schemas, as well as allowing you to load schemas of your own.
The default annotation schemas contain common named entities such as Person, Organisation, Location, etc. You can modify the existing schema or create a new one, in order to tell GATE Developer about other kinds of annotations you frequently use. You can still create annotations in GATE Developer without having specified them in an annotation schema, but you may then need to tell GATE Developer about the properties of that annotation type each time you create an annotation for it.
To load a schema of your own, right-click on ‘Language Resources’ in the resources pane. Select ‘New’ then ‘Annotation schema’. A popup box will appear in which you can browse to your annotation schema XML file.
An alternative annotation editor component is available which constrains the available annotation types and features much more tightly, based on the annotation schemas that are currently loaded. This is particularly useful when annotating large quantities of data or for use by less skilled users.
To use this, you must load the Schema_Annotation_Editor plugin. With this plugin loaded, the annotation editor will only offer the annotation types permitted by the currently loaded set of schemas, and when you select an annotation type only the features permitted by the schema are available to edit1. Where a feature is declared as having an enumerated type the available enumeration values are presented as an array of buttons, making it easy to select the required value quickly.
We suggest you to use your browser to print a document as GATE don’t propose a printing facility for the moment.
First save your document by right clicking on the document in the left resources tree then choose ‘Save Preserving Format’. You will get an XML file with all the annotations highlighted as XML tags plus the ‘Original markups’ annotations set.
Then add a stylesheet processing instruction at the beginning of the XML file:
And create a file ‘gate.css’ in the same directory:
Finally open the XML file in your browser and print it.
Note that overlapping annotations, cannot be expressed correctly with inline XML tags and thus won’t be displayed correctly.
In GATE, processing resources are used to automatically create and manipulate annotations on documents. We will talk about processing resources in the next section. However, we must first introduce CREOLE plugins. In most cases, in order to use a particular processing resource (and certain language resources) you must first load the CREOLE plugin that contains it. This section talks about using CREOLE plugins. Then, in Section 3.6, we will talk about creating and using processing resources.
The definitions of CREOLE resources (e.g. processing resources such as taggers and parsers, see Chapter 4) are stored in CREOLE directories (directories containing an XML file describing the resources, the Java archive with the compiled executable code and whatever libraries are required by the resources).
Starting with version 3, CREOLE directories are called ‘CREOLE plugins’ or simply ‘plugins’. In previous versions, the CREOLE resources distributed with GATE used to be included in the monolithic gate.jar archive. Version 3 includes them as separate directories under the plugins directory of the distribution. This allows easy access to the linguistic resources used without the requirement to unpack the gate.jar file.
Plugins can have one or more of the following states in relation with GATE:
The default location for installed plugins can be modified using the gate.plugins.home system property while the list of auto-loadable plugins can be set using the load.plugin.path property, see Section 2.3 above.
The CREOLE plugins can be managed through the graphical user interface which can be activated by selecting ‘Manage CREOLE Plugins’ from the ‘File’ menu. This will bring up a window listing all the known plugins. For each plugin there are two check-boxes – one labelled ‘Load now’, which will load the plugin, and the other labelled ‘Load always’ which will add the plugin to the list of auto-loadable plugins. A ‘Delete’ button is also provided – which will remove the plugin from the list of known plugins. This operation does not delete the actual plugin directory. Installed plugins are found automatically when GATE is started; if an installed plugin is deleted from the list, it will re-appear next time GATE is launched.
If you select a plugin, you will see in the pane on the right the list of resources that plugin contains. For example, in figure 3.12, the ‘Alignment’ plugin is selected, and you can see that it contains ten processing resources; ‘Compound Document’, ‘Compound Document From Xml’, ‘Compound Document Editor’, ‘GATE Composite document’ etc. If you wish to use a particular resource you will have to ascertain which plugin contains it. This list can be useful for that. Alternatively, the GATE website provides a directory of plugins and their processing resources.
Having loaded the plugins you need, the resources they define will be available for use. Typically, to the GATE Developer user, this means that they will appear on the ‘New’ menu when you right-click on ‘Processing Resources’ in the resources pane, although some special plugins have different effects; for example, the Schema_Annotation_Editor (see Section 3.4.6).
This section describes how to load and run CREOLE resources not present in ANNIE. To load ANNIE, see Section 3.7.3. For technical descriptions of these resources, see the appropriate chapter in Part III (e.g. Chapter 19). First ensure that the necessary plugins have been loaded (see Section 3.5). If the resource you require does not appear in the list of Processing Resources, then you probably do not have the necessary plugin loaded. Processing resources are loaded by selecting them from the set of Processing Resources: right click on Processing Resources or select ‘New Processing Resource’ from the File menu.
For example, use the Plugin Console Manager to load the ‘Tools’ plugin. When you right click on ‘Processing Resources’ in the resources pane and select ‘New’ you have the option to create any of the processing resources that plugin provides. You may choose to create a ‘GATE Morphological Analyser’, with the default parameters. Having done this, an instance of the GATE Morphological Analyser appears under ‘Processing Resources’. This processing resource, or PR, is now available to use. Double-clicking on it in the resources pane reveals its initialisation parameters, see figure 3.13.
This processing resource is now available to be added to applications. It must be added to an application before it can be applied to documents. You may create as many of a particular processing resource as you wish, for example with different initialisation parameters. Section 3.7 talks about creating and running applications.
See also the movie for loading processing resources.
Once all the resources you need have been loaded, an application can be created from them, and run on your corpus. Right click on ‘Applications’ and select ‘New’ and then either ‘Corpus Pipeline’ or ‘Pipeline’. A pipeline application can only be run over a single document, while a corpus pipeline can be run over a whole corpus.
To build the pipeline, double click on it, and select the resources needed to run the application (you may not necessarily wish to use all those which have been loaded).
Transfer the necessary components from the set of ‘loaded components’ displayed on the left hand side of the main window to the set of ‘selected components’ on the right, by selecting each component and clicking on the left and right arrows, or by double-clicking on each component.
Ensure that the components selected are listed in the correct order for processing (starting from the top). If not, select a component and move it up or down the list using the up/down arrows at the left side of the pane.
Ensure that any parameters necessary are set for each processing resource (by clicking on the resource from the list of selected resources and checking the relevant parameters from the pane below). For example, if you wish to use annotation sets other than the Default one, these must be defined for each processing resource.
Note that if a corpus pipeline is used, the corpus needs only to be set once, using the drop-down menu beside the ‘corpus’ box. If a pipeline is used, the document must be selected for each processing resource used.
Finally, click on ‘Run’ to run the application on the document or corpus.
See also the movie for loading and running processing resources.
For how to use the conditional versions of the pipelines see Section 3.7.2 and for saving/restoring the configuration of an application see Section 3.8.3.
To avoid loading all your documents at the same time you can run an application on a datastore corpus.
To do this you need to load your datastore, see section 3.8.2, and to load the corpus from the datastore by double clicking on it in the datastore viewer.
Then, in the application viewer, you need to select this corpus in the drop down list of corpora.
When you run the application on the corpus datastore, each document will be loaded, processed, saved then unloaded. So at any time there will be only one document from the datastore corpus loaded. This prevent memory shortage but is also a little bit slower than if all your documents were already loaded.
The processed documents are automatically saved back to the datastore so you may want to use a copy of the datastore to experiment.
Be very careful that if you have some documents from the datastore corpus already loaded before running the application then they will not be unloaded nor saved. To save such document you have to right click on it in the resources tree view and save it to the datastore.
The ‘Conditional Pipeline’ and ‘Conditional Corpus Pipeline’ application types are conditional versions of the pipelines mentioned in Section 3.7 and allow processing resources to be run or not according to the value of a feature on the document. In terms of graphical interface, the only addition brought by the conditional versions of the applications is a box situated underneath the lists of available and selected resources which allows the user to choose whether the currently selected processing resource will run always, never or only on the documents that have a particular value for a named feature.
If the Yes option is selected then the corresponding resource will be run on all the documents processed by the application as in the case of non-conditional applications. If the No option is selected then the corresponding resource will never be run; the application will simply ignore its presence. This option can be used to temporarily and quickly disable an application component, for debugging purposes for example.
The If value of feature option permits running specific application components conditionally on document features. When selected, this option enables two text input fields that are used to enter the name of a feature and the value of that feature for which the corresponding processing resource will be run. When a conditional application is run over a document, for each component that has an associated condition, the value of the named feature is checked on the document and the component will only be used if the value entered by the user matches the one contained in the document features.
At first sight the conditional behaviour available with these controller may seem limited, but in fact it is very powerful when used in conjunction with JAPE grammars (see chapter 8). Complex conditions can be encoded in JAPE rules which set the appropriate feature values on the document for use by the conditional controllers. Alternatively, the Groovy plugin provides a scriptable controller (see section 7.16.3) in which the execution strategy is defined by a Groovy script, allowing much richer conditional behaviour to be encoded directly in the controller’s configuration.
This section describes how to load and run ANNIE (see Chapter 6) from GATE Developer. ANNIE is a good place to start because it provides a complete information extraction application, that you can run on any corpus. You can then view the effects.
From the File menu, select ‘Load ANNIE System’. To run it in its default state, choose ‘with Defaults’. This will automatically load all the ANNIE resources, and create a corpus pipeline called ANNIE with the correct resources selected in the right order, and the default input and output annotation sets.
If ‘without Defaults’ is selected, the same processing resources will be loaded, but a popup window will appear for each resource, which enables the user to specify a name, location and other parameters for the resource. This is exactly the same procedure as for loading a processing resource individually, the difference being that the system automatically selects those resources contained within ANNIE. When the resources have been loaded, a corpus pipeline called ANNIE will be created as before.
The next step is to add a corpus (see Section 3.3), and select this corpus from the drop-down corpus menu in the Serial Application editor. Finally click on ‘Run’ from the Serial Application editor, or by right clicking on the application name in the resources pane and selecting ‘Run’. (Many people prefer to switch to the messages tab, then run their application by right-clicking on it in the resources pane, because then it is possible to monitor any messages that appear whilst the application is running.)
To view the results, double click on one of the document contained in the corpus processed in the left hand tree view. No annotation sets nor annotations will be shown until annotations are selected in the annotation sets; the ‘Default’ set is indicated only with an unlabelled right-arrowhead which must be selected in order to make visible the available annotations. Open the default annotation set and select some of the annotations to see what the ANNIE application has done.
See also the movie for loading and running ANNIE.
You will find the ANNIE resources in gate/plugins/ANNIE/resources. Simply locate the existing resources you want to modify, make a copy with a new name, edit them, and load the new resources into GATE as new Processing Resources (see Section 3.6).
In this section, we will describe how applications and language resources can be saved for use outside of GATE and for use with GATE at a later time. Section 3.8.1 talks about saving documents to file. Section 3.8.2 outlines how to use datastores. Section 3.8.3 talks about saving application states (resource parameter states), and Section 3.8.4 talks about exporting applications together with referenced files and resources to a ZIP file.
There are three main ways to save annotated documents:
This section describes how to use the first two options.
Both types of data export are available in the popup menu triggered by right-clicking on a document in the resources tree (see Section 3.1): type 1 is called ‘Save Preserving Format’ and type 2 is called ‘Save as XML’. In addition, all documents in a corpus can be saved as individual XML files into a directory by right-clicking on the corpus in the resources tree and choosing the option ‘Save as XML‘.
Selecting the save as XML option leads to a file open dialogue; give the name of the file you want to create, and the whole document and all its data will be exported to that file. If you later create a document from that file, the state will be restored. (Note: because GATE’s annotation model is richer than that of XML, and because our XML dump implementation sometimes cuts corners2, the state may not be identical after restoration. If your intention is to store the state for later use, use a DataStore instead.)
The ‘Save Preserving Format’ option also leads to a file dialogue; give a name and the data you require will be dumped into the file. The action can be used for documents that were created from files using the XML or HTML format. It will save all the original tags as well as the document annotations that are currently displayed in the ‘Annotations List’ view. This option is useful for selectively saving only some annotation types.
The annotations are saved as normal document tags, using the annotation type as the tag name. If the advanced option ‘Include annotation features for “Save Preserving Format”’ (see Section 2.4) is set to true, then the annotation features will also be saved as tag attributes.
Using this operation for GATE documents that were not created from an HTML or XML file results in a plain text file, with in-line tags for the saved annotations.
Note that GATE’s model of annotation allows graph structures, which are difficult to represent in XML (XML is a tree-structured representation format). During the dump process, annotations that cross each other in ways that cannot be represented in legal XML will be discarded, and a warning message printed.
Where corpora are large, the memory available may not be sufficient to have all documents open simultaneously. The datastore functionality provides the option to save documents to disk and open them only one at a time for processing. This means that much larger corpora can be used. A datastore can also be useful for saving documents in an efficient and lossless way.
To save a text in a datastore, a new datastore must first be created if one does not already exist. Create a datastore by right clicking on Datastore in the left hand pane, and select the option ‘Create Datastore’. Select the data store type you wish to use. Create a directory to be used as the datastore (note that the datastore is a directory and not a file).
You can either save a whole corpus to the datastore (in which case the structure of the corpus will be preserved) or you can save individual documents. The recommended method is to save the whole corpus. To save a corpus, right click on the corpus name and select the ‘Save to...’ option (giving the name of the datastore created earlier). To save individual documents to the datastore, right clicking on each document name and follow the same procedure.
To load a document from a datastore, do not try to load it as a language resource. Instead, open the datastore by right clicking on Datastore in the left hand pane, select ‘Open Datastore’ and choose the datastore to open. The datastore tree will appear in the main window. Double click on a corpus or document in this tree to open it. To save a corpus and document back to the same datastore, simply select the ‘Save’ option.
See also the movie for creating a datastore and the movie for loading corpus and documents from a datastore.
Resources, and applications that are made up of them, are created based on the settings of their parameters (see Section 3.6). It is possible to save the data used to create an application to a file and re-load it later. To save the application to a file, right click on it in the resources tree and select ‘Save application state’, which will give you a file creation dialogue. Choose a file name that ends in gapp as this file dialog and the one for loading application states age displays all files which have a name ending in gapp. A common convention is to use .gapp as a file extension.
To restore the application later, select ‘Restore application from file’ from the ‘File’ menu.
Note that the data that is saved represents how to recreate an application – not the resources that make up the application itself. So, for example, if your application has a resource that initialises itself from some file (e.g. a grammar, a document) then that file must still exist when you restore the application.
In case you don’t want to save the corpus configuration associated with the application then you must select ‘<none>’ in the corpus list of the application before saving the application.
The file resulting from saving the application state contains the values of the initialisation and runtime parameters for all the processing resources contained by the stored application as well as the values of the initialisation parameters for all the language resources referenced by those processing resources. Note that if you reference a document that has been created with an empty URL and empty string content parameter and subsequently been manually edited to add content, that content will not be saved. In order for document content to be preserved, load the document from an URL, specify the content as for the string content parameter or use a document from a datastore.
For the parameters of type URL (which are typically used to select external resources such as grammars or rules files) a transformation is applied so that the paths are stored relative to the location of the file used to store the state. This means that the resource files which are not part of GATE and used by an application do not need to be in the same location as when the application was initially created but rather in the same location relative to the location of the application file. It also means that the resource files which are part of GATE should be found correctly no matter where GATE is installed. This allows the creation and deployment of portable applications by keeping the application file and the resource files used by the application together.
If you want to save your application along with all the resources it requires you can use the ‘Export for Teamware’ option (see Section 3.8.4).
See also the movie for saving and restoring applications.
When you save an application using the ‘Save application state’ option (see Section 3.8.3), the saved file contains references to the plugins that were loaded when the application was saved, and to any resource files required by the application. To be able to reload the file, these plugins and other dependencies must exist at the same locations (relative to the saved state file). While this is fine for saving and loading applications on a single machine it means that if you want to package your application to run it elsewhere (e.g. deploy it to a GATE Teamware installation) then you need to be careful to include all the resource files and plugins at the right locations in your package. The ‘Export for Teamware’ option on the right-click menu for an application helps to automate this process.
When you export an application in this way, GATE Developer produces a ZIP file containing the saved application state (in the same format as ‘Save application state’). Any plugins and resource files that the application refers to are also included in the zip file, and the relative paths in the saved state are rewritten to point to the correct locations within the package. The resulting package is therefore self-contained and can be copied to another machine and unpacked there, or passed to your Teamware Administrator for deployment.
As well as selecting the location where you want to save the package, the ‘Export for Teamware’ option will also prompt you to select the annotation sets that your application uses for input and output. For example, if your application makes use of the unpacked XML markup in source documents and creates annotations in the default set then you would select ‘Original markups’ as an input set and the ‘<Default annotation set>’ as an output set. GATE Developer will try to make an educated guess at the correct sets but you should check and amend the lists as necessary.
There are a few important points to note about the export process:
If you require more flexibility than this option provides you should read Section E.2, which describes the underlying Ant task that the exporter uses.
You can use various keyboard shortcuts for common tasks in GATE Developer. These are listed in this section.
General (Section 3.1):
Resources tree (Section 3.1):
Document editor (Section 3.2):
Annotation editor (Section 3.4):
Annic/Lucene datastore (Chapter 9):
Annic/Lucene query text field (Chapter 9):
GATE can remember Developer options and the state of the resource tree when it exits. The options are saved by default; the session state is not saved by default. This default behaviour can be changed from the ‘Advanced’ tab of the ‘Configuration’ choice on the ‘Options’ menu.
If a problem occurs and the saved data prevents GATE Developer from starting, you can fix this by deleting the configuration and session data files. These are stored in your home directory, and are called gate.xml and gate.sesssion or .gate.xml and .gate.sesssion depending on platform. On Windows your home is:
GATE provides various facilities for working with Unicode beyond those that come as default with Java3:
1 using the editor:
In GATE Developer, select ‘Unicode editor’ from the ‘Tools’ menu. This will display an editor
window, and, when a language with a custom input method is selected for input (see next section),
a virtual keyboard window with the characters of the language assigned to the keys on the
keyboard. You can enter data either by typing as normal, or with mouse clicks on the virtual
keyboard.
2 configuring input methods:
In the editor and in GATE Developer’s main window, the ‘Options’ menu has an ‘Input methods’
choice. All supported input languages (a superset of the JDK languages) are available here. Note
that you need to use a font capable of displaying the language you select. By default
GATE Developer will choose a Unicode font if it can find one on the platform you’re
running on. Otherwise, select a font manually from the ‘Options’ menu ‘Configuration’
choice.
3 using the development kit:
GUK, the GATE Unicode Kit, is documented at:
http://gate.ac.uk/gate/doc/javadoc/guk/package-summary.html.
4 reading different character encodings:
When you create a document from a URL pointing to textual data in GATE, you have to tell the
system what character encoding the text is stored in. By default, GATE will set this
parameter to be the empty string. This tells Java to use the default encoding for whatever
platform it is running on at the time – e.g. on Western versions of Windows this will be
ISO-8859-1, and Eastern ones ISO-8859-9. A popular way to store Unicode documents is
in UTF-8, which is a superset of ASCII (but can still store all Unicode data); if you
get an error message about document I/O during reading, try setting the encoding to
UTF-8, or some other locally popular encoding. (To see a list of available encodings, try
opening a document in GATE’s unicode editor – you will be prompted to select an
encoding.)
…Noam Chomsky’s answer in Secrets, Lies and Democracy (David Barsamian 1994; Odonian) to ‘What do you think about the Internet?’
‘I think that there are good things about it, but there are also aspects of it that concern and worry me. This is an intuitive response – I can’t prove it – but my feeling is that, since people aren’t Martians or robots, direct face-to-face contact is an extremely important part of human life. It helps develop self-understanding and the growth of a healthy personality.
‘You just have a different relationship to somebody when you’re looking at them than you do when you’re punching away at a keyboard and some symbols come back. I suspect that extending that form of abstract and remote relationship, instead of direct, personal contact, is going to have unpleasant effects on what people are like. It will diminish their humanity, I think.’
Chomsky, quoted at http://photo.net/wtr/dead-trees/53015.htm.
The GATE architecture is based on components: reusable chunks of software with well-defined interfaces that may be deployed in a variety of contexts. The design of GATE is based on an analysis of previous work on infrastructure for LE, and of the typical types of software entities found in the fields of NLP and CL (see in particular chapters 4–6 of [Cunningham 00]). Our research suggested that a profitable way to support LE software development was an architecture that breaks down such programs into components of various types. Because LE practice varies very widely (it is, after all, predominantly a research field), the architecture must avoid restricting the sorts of components that developers can plug into the infrastructure. The GATE framework accomplishes this via an adapted version of the Java Beans component framework from Sun, as described in section 4.2.
GATE components may be implemented by a variety of programming languages and databases, but in each case they are represented to the system as a Java class. This class may do nothing other than call the underlying program, or provide an access layer to a database; on the other hand it may implement the whole component.
GATE components are one of three types:
The distinction between language resources and processing resources is explored more fully in section C.1.1. Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection of REusable Objects for Language Engineering.
In the rest of this chapter:
GATE allows resource implementations and Language Resource persistent data to be distributed over the Web, and uses Java annotations and XML for configuration of resources (and GATE itself).
Resource implementations are grouped together as ‘plugins’, stored at a URL (when the resources are in the local file system this can be a file:/ URL). When a plugin is loaded into GATE it looks for a configuration file called creole.xml relative to the plugin URL and uses the contents of this file to determine what resources this plugin declares and where to find the classes that implement the resource types (typically these classes are stored in a JAR file in the plugin directory). Configuration data for the resources may be stored directly in the creole.xml file, or it may be stored as Java annotations on the resource classes themselves; in either case GATE retrieves this configuration information and adds the resource definitions to the CREOLE register. When a user requests an instantiation of a resource, GATE creates an instance of the resource class in the virtual machine.
Language resource data can be stored in binary serialised form in the local file system.
We can think of the GATE framework as a backplane into which users can plug CREOLE components. The user gives the system a list of URLs to search when it starts up, and components at those locations are loaded by the system.
The backplane performs these functions:
A set of components plus the framework is a deployment unit which can be embedded in another application.
At their most basic, all GATE resources are Java Beans, the Java platform’s model of software components. Beans are simply Java classes that obey certain interface conventions:
GATE uses Java Beans conventions to construct and configure resources at runtime, and defines interfaces that different component types must implement.
CREOLE resources exhibit a variety of forms depending on the perspective they are viewed from. Their implementation is as a Java class plus an XML metadata file living at the same URL. When using GATE Developer, resources can be loaded and viewed via the resources tree (left pane) and the ‘create resource’ mechanism. When programming with GATE Embedded, they are Java objects that are obtained by making calls to GATE’s Factory class. These various incarnations are the phases of a CREOLE resource’s ‘lifecycle’. Depending on what sort of task you are using GATE for, you may use resources in any or all of these phases. For example, you may only be interested in getting a graphical view of what GATE’s ANNIE Information Extraction system (see Chapter 6) does; in this case you will use GATE Developer to load the ANNIE resources, and load a document, and create an ANNIE application and run it on the document. If, on the other hand, you want to create your own resources, or modify the Java code of an existing resource (as opposed to just modifying its grammar, for example), you will need to deal with all the lifecycle phases.
The various phases may be summarised as:
PRs can be combined into applications. Applications model a control strategy for the execution of PRs. In GATE, applications are called ‘controllers’ accordingly.
Currently only sequential, or pipeline, execution is supported. There are two main types of pipeline:
Conditional versions of these controllers are also available. These allow processing resources to be run conditionally on document features. See Section 3.7.2 for how to use these. If more flexibility is required, the Groovy plugin provides a scriptable controller (see section 7.16.3) whose execution strategy is specified using the Groovy programming language.
Controllers are themselves PRs – in particular a simple pipeline is a standard PR and a corpus pipeline is a LanguageAnalyser – so one pipeline can be nested in another. This is particularly useful with conditional controllers to group together a set of PRs that can all be turned on or off as a group.
There is also a real-time version of the corpus pipeline. When creating such a controller, a timeout parameter needs to be set which determines the maximum amount of time (in milliseconds) allowed for the processing of a document. Documents that take longer to process, are simply ignored and the execution moves to the next document after the timeout interval has lapsed.
All controllers have special handling for processing resources that implement the interface gate.creole.ControllerAwarePR. This interface provides methods that are called by the controller at the start and end of the whole application’s execution – for a corpus pipeline, this means before any document has been processed and after all documents in the corpus have been processed, which is useful for PRs that need to share data structures across the whole corpus, build aggregate statistics, etc. For full details, see the JavaDoc documentation for ControllerAwarePR.
Language Resources can be stored in Datastores. Datastores are an abstract model of disk-based persistence, which can be implemented by various types of storage mechanism. Here are the types implemented:
GATE comes with various built-in components:
This section describes how to supply GATE with the configuration data it needs about a resource, such as what its parameters are, how to display it if it has a visualisation, etc. Several GATE resources can be grouped into a single plugin, which is a directory containing an XML configuration file called creole.xml. Configuration data for the plugin’s resources can be given in the creole.xml file or directly in the Java source file using Java 5 annotations.
A creole.xml file has a root element <CREOLE-DIRECTORY>, but the further contents of this element depend on the configuration style. The following three sections discuss the different styles – all-XML, all-annotations and a mixture of the two.
To configure your resources in the creole.xml file, the <CREOLE-DIRECTORY> element should contain one <RESOURCE> element for each resource type in the plugin. The <RESOURCE> elements may optionally be contained within a <CREOLE> element (to allow a single creole.xml file to be built up by concatenating multiple separate files). For example:
Each resource must give a name, a Java class and the JAR file that it can be loaded from. The above example is taken from the Parser_Minipar plugin, and defines a single resource with a number of parameters.
The full list of valid elements under <RESOURCE> is as follows:
For visual resources, a <GUI> element should also be provided. This takes a TYPE attribute, which can have the value LARGE or SMALL. LARGE means that the visual resource is a large viewer and should appear in the main part of the GATE Developer window on the right hand side, SMALL means the VR is a small viewer which appears in the space below the resources tree in the bottom left. The <GUI> element supports the following sub-elements:
For annotation viewers, you should specify an <ANNOTATION_TYPE_DISPLAYED> element giving the annotation type that the viewer can display (e.g. Sentence).
Resources may also have parameters of various types. These resources, from the GATE distribution, illustrate the various types of parameters:
Parameters may be optional, and may have default values (and may have comments to describe their purpose, which is displayed by GATE Developer during interactive parameter setting).
Some PR parameters are execution time (RUNTIME), some are initialisation time. E.g. at execution time a doc is supplied to a language analyser; at initialisation time a grammar may be supplied to a language analyser.
The <PARAMETER> tag takes the following attributes:
It is possible for two or more parameters to be mutually exclusive (i.e. a user must specify one or the other but not both). In this case the <PARAMETER> elements should be grouped together under an <OR> element.
The type of the parameter is specified as the text of the <PARAMETER> element, and the type supplied must match the return type of the parameter’s get method. Any reference type (class, interface or enum) may be used as the parameter type, including other resource types – in this case GATE Developer will offer a list of the loaded instances of that resource as options for the parameter value. Primitive types (char, boolean, …) are not supported, instead you should use the corresponding wrapper type (java.lang.Character, java.lang.Boolean, …). If the getter returns a parameterized type (e.g. List<Integer>) you should just specify the raw type (java.util.List) here2.
The DEFAULT string is converted to the appropriate type for the parameter - java.lang.String parameters use the value directly, primitive wrapper types e.g. java.lang.Integer use their respective valueOf methods, and other built-in Java types can have defaults specified provided they have a constructor taking a String.
The type java.net.URL is treated specially: if the default string is not an absolute URL (e.g. http://gate.ac.uk/) then it is treated as a path relative to the location of the creole.xml file. Thus a DEFAULT of ‘resources/main.jape’ in the file file:/opt/MyPlugin/creole.xml is treated as the absolute URL file:/opt/MyPlugin/resources/main.jape.
For Collection-valued parameters multiple values may be specified, separated by semicolons, e.g. ‘foo;bar;baz’; if the parameter’s type is an interface – Collection or one of its sub-interfaces (e.g. List) – a suitable concrete class (e.g. ArrayList, HashSet) will be chosen automatically for the default value.
For parameters of type gate.FeatureMap multiple name=value pairs can be specified, e.g. ‘kind=word;orth=upperInitial’. For enum-valued parameters the default string is taken as the name of the enum constant to use. Finally, if no DEFAULT attribute is specified, the default value is null.
As an alternative to the XML configuration style, GATE provides Java 5 annotation types to embed the configuration data directly in the Java source code. @CreoleResource is used to mark a class as a GATE resource, and parameter information is provided through annotations on the JavaBean set methods. At runtime these annotations are read and mapped into the equivalent entries in creole.xml before parsing. The metadata annotation types are all marked @Documented so the CREOLE configuration data will be visible in the generated JavaDoc documentation.
For more detailed information, see the JavaDoc documentation for gate.creole.metadata.
To use annotation-driven configuration a creole.xml file is still required but it need only contain the following:
This tells GATE to load myPlugin.jar and scan its contents looking for resource classes annotated with @CreoleResource. Other JAR files required by the plugin can be specified using other <JAR> elements without SCAN="true".
To mark a class as a CREOLE resource, simply use the @CreoleResource annotation (in the gate.creole.metadata package), for example:
The @CreoleResource annotation provides slots for all the values that can be specified under <RESOURCE> in creole.xml, except <CLASS> (inferred from the name of the annotated class) and <JAR> (taken to be the JAR containing the class):
For visual resources only, the following elements are also available:
For annotation viewers, you should specify an annotationTypeDisplayed element giving the annotation type that the viewer can display (e.g. Sentence).
Parameters are declared by placing annotations on their JavaBean set methods. To mark a setter method as a parameter, use the @CreoleParameter annotation, for example:
GATE will infer the parameter’s name from the name of the JavaBean property in the usual way (i.e. strip off the leading set and convert the following character to lower case, so in this example the name is abbrListUrl). The parameter name is not taken from the name of the method parameter. The parameter’s type is inferred from the type of the method parameter (java.net.URL in this case).
The annotation elements of @CreoleParameter correspond to the attributes of the <PARAMETER> tag in the XML configuration style:
Mutually-exclusive parameters (such as would be grouped in an <OR> in creole.xml) are handled by adding a disjunction="label" to the @CreoleParameter annotation – all parameters that share the same label are grouped in the same disjunction.
Optional and runtime parameters are marked using extra annotations, for example:
Unlike with pure XML configuration, when using annotations a resource will inherit any configuration data that was not explicitly specified from annotations on its parent class and on any interfaces it implements. Specifically, if you do not specify a comment, interfaceName, icon, annotationTypeDisplayed or the GUI-related elements (guiType and resourceDisplayed) on your @CreoleResource annotation then GATE will look up the class tree for other @CreoleResource annotations, first on the superclass, its superclass, etc., then at any implemented interfaces, and use the first value it finds. This is useful if you are defining a family of related resources that inherit from a common base class.
The resource name and the isPrivate and mainViewer flags are not inherited.
Parameter definitions are inherited in a similar way. This is one of the big advantages of annotation configuration over pure XML – if one resource class extends another then with pure XML configuration all the parent class’s parameter definitions must be duplicated in the subclass’s creole.xml definition. With annotations, parameters are inherited from the parent class (and its parent, etc.) as well as from any interfaces implemented. For example, the gate.LanguageAnalyser interface provides two parameter definitions via annotated set methods, for the corpus and document parameters. Any @CreoleResource annotated class that implements LanguageAnalyser, directly or indirectly, will get these parameters automatically.
Of course, there are some cases where this behaviour is not desirable, for example if a subclass calculates a value for a superclass parameter rather than having the user set it directly. In this case you can hide the parameter by overriding the set method in the subclass and using a marker annotation:
The overriding method will typically just call the superclass one, as its only purpose is to provide a place to put the @HiddenCreoleParameter annotation.
Alternatively, you may want to override some of the configuration for a parameter but inherit the rest from the superclass. Again, this is handled by trivially overriding the set method and re-annotating it:
Note that for backwards compatibility, data is only inherited from superclass annotations if the subclass is itself annotated with @CreoleResource. If the subclass is not annotated then GATE assumes that all its configuration is contained in creole.xml in the usual way.
It is possible and often useful to mix and match the XML and annotation-driven configuration styles. The rule is always that anything specified in the XML takes priority over the annotations. The following examples show what this allows.
Suppose you have a plugin from some third party that uses annotation-driven configuration. You don’t have the source code but you would like to override the default value for one of the parameters of one of the plugin’s resources. You can do this in the creole.xml:
The default value for the listUrl parameter in the annotated class will be replaced by your value.
For resources like document formats, where there should always and only be one instance in GATE at any time, it makes sense to put the auto-instance definitions in the @CreoleResource annotation. But if the automatically created instances are a convenience rather than a necessity it may be better to define them in XML so other users can disable them without re-compiling the class:
If you would prefer to use XML configuration for your own resources, but would like to benefit from the parameter inheritance features of the annotation-driven approach, you can write a normal creole.xml file with all your configuration and just add a blank @CreoleResource annotation to your class. For example:
N.B. Without the @CreoleResource the parameters would not be inherited.
Visual Resources allow a developer to provide a GUI to interact with a particular resource type (PR or LR), but sometimes it is useful to provide general utilities for use in the GATE Developer GUI that are not tied to any specific resource type. Examples include the annotation diff tool and the Groovy console (provided by the Groovy plugin), both of which are self-contained tools that display in their own top-level window. To support this, the CREOLE model has the concept of a tool.
A resource type is marked as a tool by using the <TOOL/> element in its creole.xml definition, or by setting tool = true if using the @CreoleResource annotation configuration style. If a resource is declared to be a tool, and written to implement the gate.gui.ActionsPublisher interface, then whenever an instance of the resource is created its published actions will be added to the “Tools” menu in GATE Developer.
Since the published actions of every instance of the resource will be added to the tools menu, it is best not to use this mechanism on resource types that can be instantiated by the user. The “tool” marker is best used in combination with the “private” flag (to hide the resource from the list of available types in the GUI) and one or more hidden autoinstance definitions to create a limited number of instances of the resource when its defining plugin is loaded. See the GroovySupport resource in the Groovy plugin for an example of this.
If your plugin provides a number of tools (or a number of actions from the same tool) you may wish to organise your actions into one or more sub-menus, rather than placing them all on the single top-level tools menu. To do this, you need to put a special value into the actions returned by the tool’s getActions() method:
The key must be GateConstants.MENU_PATH_KEY and the value must be an array of strings. Each string in the array represents the name of one level of sub-menus. Thus in the example above the action would be placed under “Tools → Acme toolkit → Statistics”. If no MENU_PATH_KEY value is provided the action will be placed directly on the Tools menu.
Sometimes in life you’ve got to dance like nobody’s watching.
…
I think they should introduce ‘sleeping’ to the Olympics. It would be an excellent
field event, in which the ‘athletes’ (for want of a better word) all lay down in beds,
just beyond where the javelins land, and the first one to fall asleep and not wake
up for three hours would win gold. I, for one, would be interested in seeing what
kind of personality would be suited to sleeping in a competitive environment.
…
Life is a mystery to be lived, not a problem to be solved.
Round Ireland with a Fridge, Tony Hawks, 1998 (pp. 119, 147, 179).
This chapter documents GATE’s model of corpora, documents and annotations on documents. Section 5.1 describes the simple attribute/value data model that corpora, documents and annotations all share. Section 5.2, Section 5.3 and Section 5.4 describe corpora, documents and annotations on documents respectively. Section 5.5 describes GATE’s support for diverse document formats, and Section 5.5.2 describes facilities for XML input/output.
GATE has a single model for information that describes documents, collections of documents (corpora), and annotations on documents, based on attribute/value pairs. Attribute names are strings; values can be any Java object. The API for accessing this feature data is Java’s Map interface (part of the Collections API).
A Corpus in GATE is a Java Set whose members are Documents. Both Corpora and Documents are types of LanguageResource (LR); all LRs have a FeatureMap (a Java Map) associated with them that stored attribute/value information about the resource. FeatureMaps are also used to associate arbitrary information with ranges of documents (e.g. pieces of text) via the annotation model (see below).
Documents have a DocumentContent which is a text at present (future versions may add support for audiovisual content) and one or more AnnotationSets which are Java Sets.
Documents are modelled as content plus annotations (see Section 5.4) plus features (see Section 5.1). The content of a document can be any subclass of DocumentContent.
Annotations are organised in graphs, which are modelled as Java sets of Annotation. Annotations may be considered as the arcs in the graph; they have a start Node and an end Node, an ID, a type and a FeatureMap. Nodes have pointers into the sources document, e.g. character offsets.
Annotation schemas provide a means to define types of annotations in GATE. GATE uses the XML Schema language supported by W3C for these definitions. When using GATE Developer to create/edit annotations, a component is available (gate.gui.SchemaAnnotationEditor) which is driven by an annotation schema file. This component will constrain the data entry process to ensure that only annotations that correspond to a particular schema are created. (Another component allows unrestricted annotations to be created.)
Schemas are resources just like other GATE components. Below we give some examples of such schemas. Section 3.4.6 describes how to create new schemas.
This section shows some simple examples of annotated documents.
This material is adapted from [Grishman 97], the TIPSTER Architecture Design document upon which GATE version 1 was based. Version 2 has a similar model, although annotations are now graphs, and instead of multiple spans per annotation each annotation now has a single start/end node pair. The current model is largely compatible with [Bird & Liberman 99], and roughly isomorphic with "stand-off markup" as latterly adopted by the SGML/XML community.
Each example is shown in the form of a table. At the top of the table is the document being annotated; immediately below the line with the document is a ruler showing the position (byte offset) of each character (see TIPSTER Architecture Design Document).
Underneath this appear the annotations, one annotation per line. For each annotation is shown its Id, Type, Span (start/end offsets derived from the start/end nodes), and Features. Integers are used as the annotation Ids. The features are shown in the form name = value.
The first example shows a single sentence and the result of three annotation procedures: tokenization with part-of-speech assignment, name recognition, and sentence boundary recognition. Each token has a single feature, its part of speech (pos), using the tag set from the University of Pennsylvania Tree Bank; each name also has a single feature, indicating the type of name: person, company, etc.
| Text | ||||
| Cyndi savoredthe soup.
| ||||
| ^0...^5...^10..^15..^20 | ||||
| Annotations
| ||||
| Id | Type | SpanStart | Span End | Features |
| 1 | token | 0 | 5 | pos=NP |
| 2 | token | 6 | 13 | pos=VBD |
| 3 | token | 14 | 17 | pos=DT |
| 4 | token | 18 | 22 | pos=NN |
| 5 | token | 22 | 23 | |
| 6 | name | 0 | 5 | name_type=person |
| 7 | sentence | 0 | 23 | |
Annotations will typically be organized to describe a hierarchical decomposition of a text. A simple illustration would be the decomposition of a sentence into tokens. A more complex case would be a full syntactic analysis, in which a sentence is decomposed into a noun phrase and a verb phrase, a verb phrase into a verb and its complement, etc. down to the level of individual tokens. Such decompositions can be represented by annotations on nested sets of spans. Both of these are illustrated in the second example, which is an elaboration of our first example to include parse information. Each non-terminal node in the parse tree is represented by an annotation of type parse.
| Text | ||||
| Cyndi savoredthe soup.
| ||||
| ^0...^5...^10..^15..^20 | ||||
| Annotations
| ||||
| Id | Type | SpanStart | Span End | Features |
| 1 | token | 0 | 5 | pos=NP |
| 2 | token | 6 | 13 | pos=VBD |
| 3 | token | 14 | 17 | pos=DT |
| 4 | token | 18 | 22 | pos=NN |
| 5 | token | 22 | 23 | |
| 6 | name | 0 | 5 | name_type=person |
| 7 | sentence | 0 | 23 | constituents=[1],[2],[3].[4],[5] |
In most cases, the hierarchical structure could be recovered from the spans. However, it may be desirable to record this structure directly through a constituents feature whose value is a sequence of annotations representing the immediate constituents of the initial annotation. For the annotations of type parse, the constituents are either non-terminals (other annotations in the parse group) or tokens. For the sentence annotation, the constituents feature points to the constituent tokens. A reference to another annotation is represented in the table as "[ Annotation Id]"; for example, "[3]" represents a reference to annotation 3. Where the value of an feature is a sequence of items, these items are separated by commas. No special operations are provided in the current architecture for manipulating constituents. At a less esoteric level, annotations can be used to record the overall structure of documents, including in particular documents which have structured headers, as is shown in the third example (Table 5.3).
| Text | ||||
| To: All Barnyard Animals
| ||||
| ^0...^5...^10..^15..^20. | ||||
| From: Chicken Little
| ||||
| ^25..^30..^35..^40.. | ||||
| Date: November 10,1194
| ||||
| ...^50..^55..^60..^65. | ||||
| Subject: Descending Firmament
| ||||
| .^70..^75..^80..^85..^90..^95 | ||||
| Priority: Urgent
| ||||
| .^100.^105.^110. | ||||
| The sky is falling. The sky is falling.
| ||||
| ....^120.^125.^130.^135.^140.^145.^150.
| ||||
| Annotations
| ||||
| Id | Type | SpanStart | Span End | Features |
| 1 | Addressee | 4 | 24 | |
| 2 | Source | 31 | 45 | |
| 3 | Date | 53 | 69 | ddmmyy=101194 |
| 4 | Subject | 78 | 98 | |
| 5 | Priority | 109 | 115 | |
| 6 | Body | 116 | 155 | |
| 7 | Sentence | 116 | 135 | |
| 8 | Sentence | 136 | 155 | |
If the Addressee, Source, ... annotations are recorded when the document is indexed for retrieval, it will be possible to perform retrieval selectively on information in particular fields. Our final example (Table 5.4) involves an annotation which effectively modifies the document. The current architecture does not make any specific provision for the modification of the original text. However, some allowance must be made for processes such as spelling correction. This information will be recorded as a correction feature on token annotations and possibly on name annotations:
| Text | ||||
| Topster tackles 2 terrorbytes.
| ||||
| ^0...^5...^10..^15..^20..^25.. | ||||
| Annotations
| ||||
| Id | Type | SpanStart | Span End | Features |
| 1 | token | 0 | 7 | pos=NP correction=TIPSTER |
| 2 | token | 8 | 15 | pos=VBZ |
| 3 | token | 16 | 17 | pos=CD |
| 4 | token | 18 | 29 | pos=NNS correction=terabytes |
| 5 | token | 29 | 30 | |
Note that annotation types should consist of a single word with no spaces. Otherwise they may not be recognised by other components such as JAPE transducers, and may create problems when annotations are saved as inline (‘Save Preserving Format’ in the context menu).
To view and edit annotation types, see Section 3.4. To add annotations of a new type, see Section 3.4.5. To add a new annotation schema, see Section 3.4.6.
The following document formats are supported by GATE:
By default GATE will try and identify the type of the document, then strip and convert any markup into GATE’s annotation format. To disable this process, set the markupAware parameter on the document to false.
When reading a document of one of these types, GATE extracts the text between tags (where such exist) and create a GATE annotation filled as follows:
The name of the tag will constitute the annotation’s type, all the tags attributes will materialize in the annotation’s features and the annotation will span over the text covered by the tag. A few exceptions of this rule apply for the RTF, Email and Plain Text formats, which will be described later in the input section of these formats.
The text between tags is extracted and appended to the GATE document’s content and all annotations created from tags will be placed into a GATE annotation set named ‘Original markups’.
Example:
If the markup is like this:
then the annotation created by GATE will look like:
The startNode and endNode are created from offsets referring the beginning and the end of ‘A piece of text’ in the document’s content.
The documents supported by GATE have to be in one of the encodings accepted by Java. The most popular is the ‘UTF-8’ encoding which is also the most storage efficient one for UNICODE. If, when loading a document in GATE the encoding parameter is set to ‘’(the empty string), then the default encoding of the platform will be used.
In order to successfully apply the document creation algorithm described above, GATE needs to detect the proper reader to use for each document format. If the user knows in advance what kind of document they are loading then they can specify the MIME type (e.g. text/html) using the init parameter mimeType, and GATE will respect this. If an explicit type is not given, GATE attempts to determine the type by other means, taking into consideration (where possible) the information provided by three sources:
The first represents the extension of a file like (xml,htm,html,txt,sgm,rtf, etc), the second represents the HTTP information sent by a web server regarding the content type of the document being send by it (text/html; text/xml, etc), and the third one represents certain sequences of chars which are ultimately number sequences. GATE is capable of supporting multimedia documents, if the right reader is added to the framework. Sometimes, multimedia documents are identified by a signature consisting in a sequence of numbers. Inside GATE they are called magic numbers. For textual documents, certain char sequences form such magic numbers. Examples of magic numbers sequences will be provided in the Input section of each format supported by GATE.
All those tests are applied to each document read, and after that, a voting mechanism decides what is the best reader to associate with the document. There is a degree of priority for all those tests. The document’s extension test has the highest priority. If the system is in doubt which reader to choose, then the one associated with document’s extension will be selected. The next higher priority is given to the web server’s content type and the third one is given to the magic numbers detection. However, any two tests that identify the same mime type, will have the highest priority in deciding the reader that will be used. The web server test is not always successful as there might be documents that are loaded from a local file system, and the magic number detection test is not always applicable. In the next paragraphs we will se how those tests are performed and what is the general mechanism behind reader detection.
The method that detects the proper reader is a static one, and it belongs to the gate.DocumentFormat class. It uses the information stored in the maps filled by the init() method of each reader. This method comes with three signatures:
The first two methods try to detect the right MimeType for the GATE document, and after that, they call the third one to return the reader associate with a MimeType. Of course, if an explicit mimeType parameter was specified, GATE calls the third form of the method directly, passing the specified type. GATE uses the implementation from ‘http://jigsaw.w3.org’ for mime types.
The magic numbers test is performed using the information form
magic2mimeTypeMap map. Each key from this map, is searched in the first bufferSize (the default
value is 2048) chars of text. The method that does this is called
runMagicNumbers(InputStreamReader aReader) and it belongs to DocumentFormat class. More
details about it can be found in the GATE API documentation.
In order to activate a reader to perform the unpacking, the creole definition of a GATE document defines a parameter called ‘markupAware’ initialized with a default value of true. This parameter, forces GATE to detect a proper reader for the document being read. If no reader is found, the document’s content is load and presented to the user, just like any other text editor (this for textual documents).
The next subsections investigates particularities for each format and will describe the file extensions registered with each document format.
GATE permits the processing of any XML document and offers support for XML namespaces. It benefits the power of Apache’s Xerces parser and also makes use of Sun’s JAXP layer. Changing the XML parser in GATE can be achieved by simply replacing the value of a Java system property (‘javax.xml.parsers.SAXParserFactory’).
GATE will accept any well formed XML document as input. Although it has the possibility to validate XML documents against DTDs it does not do so because the validating procedure is time consuming and in many cases it issues messages that are annoying for the user.
There is an open problem with the general approach of reading XML, HTML and SGML documents in GATE. As we previously said, the text covered by tags/elements is appended to the GATE document content and a GATE annotation refers to this particular span of text. When appending, in cases such as ‘end.</P><P>Start’ it might happen that the ending word of the previous annotation is concatenated with the beginning phrase of the annotation currently being created, resulting in a garbage input for GATE processing resources that operate at the text surface.
Let’s take another example in order to better understand the problem:
When the markup is transformed to annotations, it is likely that the text from the document’s content will be as follows:
This is a titleThis is a paragraphHere is an useful link
The annotations created will refer the right parts of the texts but for the GATE’s processing resources like (tokenizer, gazetter, etc) which work on this text, this will be a major disaster. Therefore, in order to prevent this problem from happening, GATE checks if it’s likely to join words and if this happens then it inserts a space between those words. So, the text will look like this after loaded in GATE Developer:
This is a title This is a paragraph Here is an useful link
There are cases when these words are meant to be joined, but they are rare. This is why it’s an open problem.
The extensions associate with the XML reader are:
The web server content type associate with xml documents is: text/xml.
The magic numbers test searches inside the document for the XML(<?xml version="1.0") signature. It is also able to detect if the XML document uses the semantics described in the GATE document format DTD (see 5.5.2 below) or uses other semantics.
GATE is capable of ensuring persistence for its resources. The types of persistent storage used for Language Resources are:
We describe the latter case here.
XML persistence doesn’t necessarily preserve all the objects belonging to the annotations, documents or corpora. Their features can be of all kinds of objects, with various layers of nesting. For example, lists containing lists containing maps, etc. Serializing these arbitrary data types in XML is not a simple task; GATE does the best it can, and supports native Java types such as Integers and Booleans, but where complex data types are used, information may be lost(the types will be converted into Strings). GATE provides a full serialization of certain types of features such as collections, strings and numbers. It is possible to serialize only those collections containing strings or numbers. The rest of other features are serialized using their string representation and when read back, they will be all strings instead of being the original objects. Consequences of this might be observed when performing evaluations (see Chapter 10).
When GATE outputs an XML document it may do so in one of two ways:
In the former case, the XML output will be close to the original document. In the latter case, the format is a GATE-specific one which can be read back by the system to recreate all the information that GATE held internally for the document.
In order to understand why there are two types of XML serialization, one needs to understand the structure of a GATE document. GATE allows a graph of annotations that refer to parts of the text. Those annotations are grouped under annotation sets. Because of this structure, sometimes it is impossible to save a document as XML using tags that surround the text referred to by the annotation, because tags crossover situations could appear (XML is essentially a tree-based model of information, whereas GATE uses graphs). Therefore, in order to preserve all annotations in a GATE document, a custom type of XML document was developed.
The problem of crossover tags appears with GATE’s second option (the preserve format one), which is implemented at the cost of losing certain annotations. The way it is applied in GATE is that it tries to restore the original markup and where it is possible, to add in the same manner annotations produced by GATE.
How to Access and Use the Two Forms of XML Serialization
This option is available in GATE Developer in the pop-up menu associated with each language resource (document or corpus). Saving a corpus as XML is done by calling ‘Save as XML’ on each document of the corpus. This option saves all the annotations of a document together their features(applying the restrictions previously discussed), using the GateDocument.dtd :
The document is saved under a name chosen by the user and it may have any extension. However, the recommended extension would be ‘xml’.
Using GATE Embedded, this option is available by calling gate.Document’s toXml() method. This method returns a string which is the XML representation of the document on which the method was called.
Note: It is recommended that the string representation to be saved on the file system using the UTF-8 encoding, as the first line of the string is : <?xml version="1.0" encoding="UTF-8"?>
Example of such a GATE format document:
Note: One must know that all features that are not collections containing numbers or strings or that are not numbers or strings are discarded. With this option, GATE does not preserve those features it cannot restore back.
The Preserve Format Option This option is available in GATE Developer from the popup menu of the annotations table. If no annotation in this table is selected, then the option will restore the document’s original markup. If certain annotations are selected, then the option will attempt to restore the original markup and insert all the selected ones. When an annotation violates the crossed over condition, that annotation is discarded and a message is issued.
This option makes it possible to generate an XML document with tags surrounding the annotation’s referenced text and features saved as attributes. All features which are collections, strings or numbers are saved, and the others are discarded. However, when read back, only the attributes under the GATE namespace (see below) are reconstructed back differently to the others. That is because GATE does not store in the XML document the information about the features class and for collections the class of the items. So, when read back, all features will become strings, except those under the GATE namespace.
One will notice that all generated tags have an attribute called ‘gateId’ under the namespace ‘http://www.gate.ac.uk’. The attribute is used when the document is read back in GATE, in order to restore the annotation’s old ID. This feature is needed because it works in close cooperation with another attribute under the same namespace, called ‘matches’. This attribute indicates annotations/tags that refer the same entity1. They are under this namespace because GATE is sensitive to them and treats them differently to all other elements with their attributes which fall under the general reading algorithm described at the beginning of this section.
The ‘gateId’ under GATE namespace is used to create an annotation which has as ID the value indicated by this attribute. The ‘matches’ attribute is used to create an ArrayList in which the items will be Integers, representing the ID of annotations that the current one matches.
Example:
If the text being processed is as follows:
What GATE does when it parses this text is it creates two annotations:
Under GATE Embedded, this option is available by calling gate.Document’s toXml(Set aSetContainingAnnotations) method. This method returns a string which is the XML representation of the document on which the method was called. If called with null as a parameter, then the method will attempt to restore only the original markup. If the parameter is a set that contains annotations, then each annotation is tested against the crossover restriction, and for those found to violate it, a warning will be issued and they will be discarded.
In the next subsections we will show how this option applies to the other formats supported by GATE.
HTML documents are parsed by GATE using the NekoHTML parser. The documents are read and created in GATE the same way as the XML documents.
The extensions associate with the HTML reader are:
The web server content type associate with html documents is: text/html.
The magic numbers test searches inside the document for the HTML(<html) signature.There are certain HTML documents that do not contain the HTML tag, so the magical numbers test might not hold.
There is a certain degree of customization for HTML documents in that GATE introduces new lines into the document’s text content in order to obtain a readable form. The annotations will refer the pieces of text as described in the original document but there will be a few extra new line characters inserted.
After reading H1, H2, H3, H4, H5, H6, TR, CENTER, LI, BR and DIV tags, GATE will introduce a new line