The GATE User Guide
Hamish Cunningham
Diana Maynard
This version of the document is for GATE version 2 alpha 3, of May 2001. It is incomplete.
Contents:
For installation and build instructions see the installation guide.
Introduction
GATE, a General Architecture for Text Engineering [Cun96b, Cun97a, Cun98, Cun99a, Cun00a, Cun01b], is a software architecture for Language Engineering [Cun99b]. More specifically, it is three things: an architecture; a framework; a development environment.
By architecture we mean an abstract description of how a language processing system may usefully be constructed, the types of component typically used and so on. By framework we mean an object-oriented class library that implements the architecture and provides a range of services that are useable in a variety of application contexts. One such application is a development environment built on top of the framework. The development environment is analogous to systems like Mathematica for Mathematicians, or JBuilder for Java programmers: it provides a convenient graphical environment for research and development of language processing software.
Version 1 of GATE was released in 1996. It was written in C++ and Tcl, has been licenced by several hundred organisations, and used in a wide range of language analysis contexts including Information Extraction (IE - [Gai98a, Cun99c]) in English, Greek, Spanish, Swedish, German, Italian and French.
Version 2 of GATE was released in Spring 2001. It is written in Java, and is available as open source free software under the GNU licence at http://gate.ac.uk/.
For more details about human language processing in general see Sheffield NLP group or this paper on Language Engineering. For more details about Information Extraction see this User Guide to IE or the Sheffield IE pages.
The rest of this section gives a general introduction to the system. The rest of the document then covers:
- how to use the development environment
- how to use the framework
- the design principles of the architecture and framework.
Architectural principles
A central idea behind the GATE architecture is that there should be no requirement for users to commit to any particular theory of language processing: the architecture strives to be non-prescriptive and theory-neutral. Therefore there is a very general model of components and the data structures they share. This is, of course, both a strength and a weakness.
(Almost) everything in GATE is a component. Components are reusable software chunks with well-defined interfaces that are conceptually separate from GATE itself. All component sets are user-extensible and together are called CREOLE - a Collection of REusable Objects for Language Engineering.
GATE-based development
The framework is a backplane into which plug CREOLE components. The user gives the system a list of URLs to search when it starts up, and components at those locations are loaded by the system. (To be precise only their configuration data is loaded to begin with; the actual classes are loaded when the user requests the instantiation of a resource.)
The backplane performs these functions:
- component discovery, bootstrapping, loading and reloading;
- native data structures for common information types;
- generalised data storage and process execution.
A set of components plus the framework is a deployment unit which can be embedded in another application.
The key task of the development environment is to facilitate constructing components.
Component types
GATE components are one of three types of specialised Java Beans:
Resource:
The top-level interface, which describes all components. What all
components share in common is that they can be loaded at runtime,
and that the set of components is extendable by clients. They
have Features, which are represented externally to the system as
"meta-data" in a format such as RDF, plain XML, or Java properties.
Resources should probably all be Java beans.
ProcessingResource:
Is a resource that is runnable, may be invoked remotely (via RMI),
and lives in class files. In order to load a PR the system just
needs to know where to find the class or jar files (which will
also include the metadata).
LanguageResource:
Is a resource that consists of data, accessed via a Java abstraction
layer. They live in relational databases.
VisualResource:
Is a visual Java bean, component of GUIs, including of the main GATE
gui. Like PRs they live in .class or .jar files.
Bits and pieces
There are built in components for common processing and data visualisation tasks. There is a finite state transduction language operating over annotations on text, called JAPE, a Jolly Advanced Pattern Eater. JAPE is based on Doug Appelt's TextPro language. There is automated measurement: precision, recall, diff over annotations on text. Support for documents in XML, SGML, HTML, RTF, email. Full Unicode support including editing in a number of languages (not supported by native JDK; thanks to Mark Leisher for help with this).Development Environment
The GATE development environment is designed to facilitate the creation, development and testing of components for language processing R\&D. We describe here how to perform these tasks, and how to use the tools for named entity recognition and results evaluation.
There are 6 main steps to using GATE.
- Bootstrap the basic software for new resources
- Instantiate the desired language resource(s)
- Instantiate appropriate processing resource(s)
- Create and run an application (a set of components)
- View the results of the application
- Apply further tools, e.g. evaluation of the results.
Bootstrapping New Resources
GATE components may be implemented by a variety of programming languages and databases, but in each case they are represented to the system as a Java class. This class may do nothing other than call the underlying program, or provide an access layer to a database; on the other hand it may implement the whole component.
The development environment will dump out the basic form of a new resource Java class to disk for you: select "Bootstrap" from the "Tools" menu.
Loading Language Resources
Load a language resource by right clicking on "Language Resources" and selecting "Create Language Resource". Select "GATE document" and a pop-up window will appear. Choose a name for the resource, and select a file or url as the value of "sourceUrl". Note that double clicking in the "values" box brings up a tree structure to enable selection of documents. Make any changes to default settings as required (e.g. encoding type used) and click OK. The document name and icon should appear in the left hand pane, and can be viewed in the main window by double clicking on the icon. The right hand pane enables annotations to be selected and viewed. At this stage, the only annotations displayed will be those which are produced as a result of the text structure analysis which transforms a text into a GATE document, e.g. xml or html tags. Additional language resources can be loaded by repeating the procedure.
Loading Processing Resources
Right click on "Processing Resources" and select "Create processing resource". Select the type of resource (e.g. tokeniser, gazetteer, etc.) from the list of options. In the pop-up box, choose a name for the resource, and either select the default value for the resource, or select a new one. Select any other values as appropriate (e.g. encoding). Click "OK". An icon should appear under "Processing Resources" in the left hand pane. Note that it may take a few seconds for the resource to be loaded. Repeat this procedure until all necessary resources have been loaded.
Running an Application
Once all the resources have been loaded, an application can be created and run. Right click on "Applications" and create a new one. Then double click on it and the "Design" tab will appear. Here you can select the resources needed to run the application (these may not be necessarily be all those which have been loaded). Transfer the necessary components from the set of "available components" displayed on the right hand side of the main window to the set of "used components" on the left, by selecting each component and clicking on the left and right arrows. Ensure that the components are listed on the left in the correct order for processing (starting from the top). If not, select a component and move it up or down the list using the up/down arrows at the bottom of the pane. Once this is complete, move to the left hand pane, select the language resource to be used (using a left click), and finally right click on the application and select "Run".
Viewing the Results
Once the system has run, open the document to be viewed with a double click. Note that it may take a few seconds for the text to be displayed if it is long. The annotation types are displayed to the right of the text. Click on Default (the default annotation set) to display the annotation types. Then select the annotation types to be viewed. A checkbox will indicate which types are currently being displayed. The text segments corresponding to these annotations will be highlighted in the main text window. Fonts and colours of the annotations can be manually altered by double-clicking on the relevant annotation. Default colours and font settings can be altered in the same way, by double-clicking on the default button.
Descriptions of the annotations are simultaneously displayed in the bottom pane. These lists can be sorted in ascending and descending order by any column, by clicking on the corresponding column heading. An arrow will appear indicating the direction of the sorting. Clicking on an entry in the table will also highlight the respective matching text portion.
Right clicking on some part of the text in the main window will bring up a box containing a list of the annotations associated with it. Selecting one of these annotation types will highlight the relevant annotation description in the lower pane, if present. If not present (because the corresponding annotation on the right hand pane has not been selected), this annotation on the right will then be automatically selected and all relevant text in the main window will be appropriately highlighted.
Although there is no cursor displayed in the various windows, they can all be scrolled using the keyboard arrows, as well as by using the scrollbars.
At any time, the main viewer can also be used to display other information, such as Messages, by clicking on the header at the top of the main window.
Adding Annotations
In order to be able to add/edit annotations in GATE, the relevant Annotation Schemata must first be loaded. This is done by selecting an Annotation Schema (which is an xml file) from the Language Resources, for each annotation type.
Once the Annotation Schemata have been loaded, the annotation types that have a Schema present inside GATE can be added or edited. To add a new annotation, select the text, right click, and select an annotation set (either the default set, which contains the annotations already found, or create a new one). Then select the name of the annotation to be created. If the annotation can have features, another window will automatically open. Select a feature from the list of possible features, and click the arrow to transfer it to the list of current features. The feature values can be edited by clicking on them. The new annotation will be added to the annotation set, and will appear in the annotation description table.
An existing annotation can be modified by selecting it from the table and double clicking on it to bring up the features window. If, however, the schema has no features defined, then the selected annotation cannot be edited (since there are no features to edit). All that can be done is to add or delete the annotation. An annotation can be deleted by selecting it from the table, right clicking on it, and selecting Delete.
An annotated text can be saved in a data store. Create a data store by right clicking on Data store and selecting the option "Create Data Store". Select "Serial DataStore" as the data store type. Create a directory to be used as the data store (note that the data store is a directory and not a file). Save the text to the data store by right clicking on the document name and selecting the "Save to" option (giving the name of the datastore created earlier).
To load a document from a data store, do not try to load it as a language resource. Instead, open the data store, and double click on it to view its contents. Double click on the relevant file to display the text. Once the text has appeared in the main window, it can be treated in the same way as any other document.
The Evaluation Tool (Annotation Diff)
The annotation tool is activated by selecting it from the Tools menu at the top of the window. It will appear in a new window. Select the key and response documents to be used (note that both must have been previously loaded into the system), the annotation type to be evaluated, and the annotation type to be used as the denominator for evaluating false positives (normally, Token). The user should ensure the key and response documents are explicitly selected, even if there are no alternative choices of document presented. Click on "do diff", and the results will be displayed.
In the main window, the key and response annotations will be displayed. They can be sorted by any category by clicking on the relevant column header. The key and response annotations will be aligned if their indices are identical, and are colour coded according to the legend displayed.
Evaluation metrics
Precision, recall and false positives are also displayed below the annotation tables, each according to 3 criteria - strict, lenient and average. The reason for these 3 criteria is to deal with partially correct responses in different ways.
- The
Strict measure considers all partially correct responses as incorrect (spurious). - The
Lenient measure considers all partially correct responses as correct. - The
Average measure allocates a half weight to partially correct responses (i.e. it takes the average of strict and lenient).
The Framework
This section gives documentation for the framework; see also the JavaDoc pages. The CookBook class gives example code for using the GATE API.
The GATE framework models language processing components and the language data they operate on as Resources Resources. The set of all resources is known as CREOLE, a Collection or REusable Objects for Language Engineering.
The terms component, resource and CREOLE object are largely synonymous.
The Processing Model
Any resource whose primary characteristics are algorithmic, such as parsers, generators and so on, is modelled as a ProcessingResource (PR). A PR is a Resource that implements the Java Runnable interface.
The Visualisation Model
Resources whose task is to display and edit other resources are modelled as VisualResources (VRs).
The Corpus Model
A Corpus in GATE is a Java Set whose members are Documents. Both Corpora and Documents are types of LanguageResource (LR); all LRs have a FeatureMap (a Java Map) associated with them that stored attribute/value information about the resource. FeatureMaps are also used to associate arbitrary information with ranges of documents (e.g. pieces of text) via the annotation model (see below).
Documents have a DocumentContent which is a text at present (future versions may add support for audiovisual content) and one or more AnnotationSets which are Java Sets.
The Annotation Model
Annotations are organised in graphs, which are modelled as Java sets of Annotation. Annotations may be considered as the arcs in the graph; they have a start Node and an end Node, an ID, a type and a FeatureMap. Nodes have pointers into the sources document, e.g. character offsets.
The rest of this section shows some simple examples of annotated documents.
This material is adapted from [Gri96b], the TIPSTER Architecture Design document upon which GATE version 1 was based. Version 2 has a similar model, although annotations are now graphs, and instead of multiple spans per annotation each annotation now has a single start/end node pair. The current model is largely compatible with [Bir99], and roughly isomorphic with "stand-off markup" as latterly adopted by the SGML/XML community.
Each example is shown in the form of a table. At the top of the table is the document being annotated; immediately below the line with the document is a ruler showing the position (byte offset) of each character. (NOTE: the ruler doesn't scale very well in HTML; for a better picture see the original TIPSTER Architecture Design Document.) Underneath this appear the annotations, one annotation per line. For each annotation is shown its Id, Type, Span (start/end offsets derived from the start/end nodes), and Features. Integers are used as the annotation Ids. The features are shown in the form name = value.
The first example shows a single sentence and the result of three annotation procedures: tokenization with part-of-speech assignment, name recognition, and sentence boundary recognition. Each token has a single feature, its part of speech (pos), using the tag set from the University of Pennsylvania Tree Bank; each name also has a single feature, indicating the type of name: person, company, etc.
Table 1. Result of annotation on a single sentence
Text | ||||
---|---|---|---|---|
Cyndi savored the soup. | ||||
|0...|5...|10..|15..|20 | ||||
Annotations | ||||
Id | Type | Span Start | Span End | Features |
1 | token | 0 | 5 | pos=NP |
2 | token | 6 | 13 | pos=VBD |
3 | token | 14 | 17 | pos=DT |
4 | token | 18 | 22 | pos=NN |
5 | token | 22 | 23 | |
6 | name | 0 | 5 | name_type=person |
7 | sentence | 0 | 23 |
Annotations will typically be organized to describe a hierarchical decomposition of a text. A simple illustration would be the decomposition of a sentence into tokens. A more complex case would be a full syntactic analysis, in which a sentence is decomposed into a noun phrase and a verb phrase, a verb phrase into a verb and its complement, etc. down to the level of individual tokens. Such decompositions can be represented by annotations on nested sets of spans. Both of these are illustrated in the second example, which is an elaboration of our first example to include parse information. Each non-terminal node in the parse tree is represented by an annotation of type parse.
Table 2. Result of annotations including parse information
Text | ||||
---|---|---|---|---|
Cyndi savored the soup. | ||||
|0...|5...|10..|15..|20 | ||||
Annotations | ||||
Id | Type | Span Start | Span End | Features |
1 | token | 0 | 5 | pos=NP |
2 | token | 6 | 13 | pos=VBD |
3 | token | 14 | 17 | pos=DT |
4 | token | 18 | 22 | pos=NN |
5 | token | 22 | 23 | |
6 | name | 0 | 5 | name_type=person |
7 | sentence | 0 | 23 | constituents=[1],[2],[3].[4],[5] |
8 | parse | 0 | 5 | symbol="NP",constituents= [1] |
9 | parse | 14 | 22 | symbol="NP",constituents=[3],[4] |
10 | parse | 6 | 22 | symbol="VP",constituents=[2],[9] |
11 | parse | 0 | 22 | symbol="S",constituents=[8],[10] |
In most cases, the hierarchical structure could be recovered from the spans. However, it may be desirable to record this structure directly through a constituents feature whose value is a sequence of annotations representing the immediate constituents of the initial annotation. For the annotations of type parse, the constituents are either non-terminals (other annotations in the parse group) or tokens. For the sentence annotation, the constituents feature points to the constituent tokens. A reference to another annotation is represented in the table as "[ Annotation Id]"; for example, "[3]" represents a reference to annotation 3. Where the value of an feature is a sequence of items, these items are separated by commas. No special operations are provided in the current architecture for manipulating constituents. At a less esoteric level, annotations can be used to record the overall structure of documents, including in particular documents which have structured headers, as is shown in the third example (Table 3).
Table 3. Annotation showing overall document structure
Text | ||||
---|---|---|---|---|
To: All Barnyard Animals | ||||
|0...|5...|10..|15..|20.. | ||||
From: Chicken Little | ||||
|25..|30..|35..|40..|45.. | ||||
Date: November 10,1194 | ||||
....|50..|55..|60..|65.. | ||||
Subject: Descending Firmament | ||||
|70..|75..|80..|85..|90..|95.. | ||||
Priority : Urgent. | ||||
|100.|105.|110.|115. | ||||
The sky is falling. The sky is falling. | ||||
....|120.|125.|130.|135.|140.|145.|150. | ||||
Annotations | ||||
Id | Type | Span Start | Span End | Features |
1 | Addressee | 4 | 24 | |
2 | Source | 31 | 45 | |
3 | Date | 53 | 69 | ddmmyy=101194 |
4 | Subject | 78 | 98 | |
5 | Priority | 109 | 115 | |
6 | Body | 116 | 155 | |
7 | Sentence | 116 | 135 | |
8 | Sentence | 136 | 155 |
If the Addressee, Source, ... annotations are recorded when the document is indexed for retrieval, it will be possible to perform retrieval selectively on information in particular fields. Our final example (Table 4) involves an annotation which effectively modifies the document. The current architecture does not make any specific provision for the modification of the original text. However, some allowance must be made for processes such as spelling correction. This information will be recorded as a correction feature on token annotations and possibly on name annotations:
Table 4. Annotation modifying the document
Text | ||||
---|---|---|---|---|
Topster tackles 2 terrorbytes. | ||||
|0...|5...|10..|15..|20..|25.. | ||||
Annotations | ||||
Id | Type | Span Start | Span End | Features |
1 | token | 0 | 7 | pos=NP correction=TIPSTER |
2 | token | 8 | 15 | pos=VBZ |
3 | token | 16 | 17 | pos=CD |
4 | token | 18 | 29 | pos=NNS correction=terabytes |
5 | token | 29 | 30 |
Design
GATE is a backplane into which specialised Java Beans plug. These beans are loose-coupled with respect to each other - they communicate entirely by means of the GATE framework. Inter-component communication is handled by model components - LanguageResources, and events.
Components are defined by conformance to various interfaces (e.g. LanguageResource), ensuring separation of interface and implementation.
Distribution and parallelism (NOT fully working as yet) is handled by controller components (and by distributing data over HTTP and JDBC).
The reason for adding to the normal bean initialisation mech is that LRs, PRs and VRs all have characteristic parameterisation phases; the GATE resources/components model makes explicit these phases.
Patterns
GATE is structured around a number of what we might call principles, or patterns, or alternatively, clever ideas stolen from better minds than mine. These patterns are:
- modelling most things as extensible sets of components;
- separating components into model, view, or controller types;
- hiding implementation behind interfaces.
Four interfaces in the top-level package describe the GATE view of components: Resource, ProcessingResource, LanguageResource and VisualResource.
Components
Architectural Principle
Wherever users of the architecture may wish to extend the set of a particular type of entity, those types should be expressed as components.
Another way to express this is to say that the architecture is based on agents. I've avoided this in the past because of an association between this term and the idea of bits of code moving around between machines of their own volition. I take this to be somewhat pointless, and probably the result of an anthropomorphic obsession with mobility as a correlate of intelligence. If we drop this connotation, however, we can say that GATE is an agent-based architecture. If we want to, that is.
Framework Expression
Many of the classes in the framework are components, by which we mean classes that conform to an interface with certain standard properties. In our case these properties are based on the Java Beans component architecture, with the addition of component metadata, automated loading and standardised storage, threading and distribution.
All components inherit from Resource, via one of:
- LanguageResource (LR) represents entities such as lexicons, corpora or ontologies;
- VisualResource (VR) represents visualisation and editing components that participate in GUIs;
- ProcessingResource (PR) represents entities that are primarily algorithmic, such as parsers, generators or ngram modellers.
Model, view, controller
According to Buschmann et al (Pattern-Oriented Software Architecture, 1996), the Model-View-Controller (MVC) pattern
...divides an interactive application into three components. The model contains the core functionality and data. Views display information to the user. Controllers handle user input. Views and controllers together comprise the user interface. A change-propagation mechanism ensures consistency between the user interface and the model. [p.125]A variant of MVC, the Document-View pattern,
...relaxes the separation of view and controller... The View component of Document-View combines the responsibilities of controller and view in MVC, and implements the user interface of the system.A benefit of both arrangements is that
...loose coupling of the document and view components enables multiple simultaneous synchronized but different views of the same document.
Geary (Graphic Java 2, 3rd Edtn., 1999) gives a slightly different view:
MVC separates applications into three types of objects:Swing, the Java user interface framework, uses[pp. 71, 75]
- Models: Maintain data and provide data accessor methods
- Views: Paint a visual representation of some or all of a model's data
- Controllers: Handle events ... By encapsulating what other architectures intertwine, MVC applications are much more flexible and reusable than their traditional counterparts.
a specialised version of the classic MVC meant to support pluggable look and feel instead of applications in general. [p. 75]
GATE may be regarded as an MVC architecture in two ways:
- directly, because we use the Swing toolkit for the GUIs;
- by analogy, where LRs are models, VRs are views and PRs are controllers. Of these, the latter sits least easily with the MVC scheme, as PRs may indeed be controllers but may also not be.
Interfaces
Architectural Principle
The implementation of types should generally be hidden from the clients of the architecture.
Framework Expression
With a few exceptions (such as for utility classes), clients of the framework work with the gate.* package. This package is mostly composed of interface definitions. Instantiations of these interfaces are obtained via the Factory class.
The subsidiary packages of GATE provide the implementations of the gate.* interfaces that are accessed via the factory. They themselves avoid directly constructing classes from other packages (with a few exceptions, such as JAPE's need for unattached annotation sets). Instead they use the factory.
Notes
Development Notes
Integrating Sicstus Prolog programs
Sicstus provide a nice interface for Java, called Jasper, based on a native code library that is available for different platforms as part of the Sicstus distribution. Linking native code with Java is a slightly risky business - unlike the gentle degradation often available via Java exceptions, native code problems are likely to crash the whole application. It is also difficult to get a bug-free implementation of this type of language mixing to work identically on different platforms. For example:
In Sicstus 3.8.6 on NT, when calling Sicstus from Java, if the memory allocation directive -Xmx200m is given to the JVM, the sicstus runtime throws this error:
{ERROR: Memory allocation failed (upper 4 bits do not match MallocBase)} Signal 127
Using a slightly different version of jasper.jar than the one in the Sicstus distribution you're getting the native code from (e.g. NT version 3.8.4 vs. Linux version 3.8.6) can crash the system, unsurprisingly. But if we include jasper.jar in the GATE libraries then this is likely to happen quite often.
For these reasons we don't include Sicstus support in GATE itself, but will be happy to supply example code from our own modules that integrate Sicstus code with GATE on request.
References
Bir99 S. Bird and M. Liberman. A Formal Framework for Linguistic Annotation. Department of Computer and Information Science, University of Pennsylvania, 1999. \small \tt http://xxx.lanl.gov/\-abs/cs.CL/9903003.
Cun98 H. Cunningham and Stevenson, M. and Wilks, Y. Implementing a Sense Tagger within a General Architecture for Language Engineering. In Proceedings of the Third Conference on New Methods in Language Engineering (NeMLaP-3), pages 59-72, Sydney, Australia, 1998.
Cun00a H. Cunningham. Software Architecture for Language Engineering. University of Sheffield, 2000. \small \tt http://gate.ac.uk/sale/thesis/.
Cun01b H. Cunningham. GATE, a General Architecture for Text Engineering. [in press], pages ??, vol ??, 2001. Accepted for publication by Computing and the Humanities, May 2001.
Cun96b H. Cunningham and Y. Wilks and R. Gaizauskas. GATE -- a General Architecture for Text Engineering. In Proceedings of the 16th Conference on Computational Linguistics (COLING-96), Copenhagen, aug, 1996.
Cun97a H. Cunningham and K. Humphreys and R. Gaizauskas and Y. Wilks. Software Infrastructure for Natural Language Processing. In Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97), mar, 1997. \small \tt http://xxx.lanl.gov/\-abs/cs.CL/9702005.
Cun99a H. Cunningham and Gaizauskas, R.G. and Humphreys, K. and Wilks, Y. Experience with a Language Engineering Architecture: Three Years of GATE. In Proceedings of the AISB'99 Workshop on Reference Architectures and Data Standards for NLP, The Society for the Study of Artificial Intelligence and Simulation of Behaviour, Edinburgh, apr, 1999.
Cun99b H. Cunningham. A Definition and Short History of Language Engineering. Journal of Natural Language Engineering, pages 1--16, vol 5, 1999.
Cun99c H. Cunningham. Information Extraction: a User Guide (revised version). Department of Computer Science, University of Sheffield, may, 1999.
Gai98a Gaizauskas, R. and Wilks, Y. Information Extraction: Beyond Document Retrieval. Journal of Documentation, pages 70-105, vol 54, 1998.
Gri96b Grishman, R. TIPSTER Architecture Design Document Version 2.3. DARPA, Lawrence Erlbaum, 1997. \small \tt http://www.itl.nist.gov/\-div894/\-894.02/\-related\_projects/\-tipster/.