Chapter 4
CREOLE: the GATE Component Model [#]
The GATE architecture is based on components: reusable chunks of software with well-defined interfaces that may be deployed in a variety of contexts. The design of GATE is based on an analysis of previous work on infrastructure for LE, and of the typical types of software entities found in the fields of NLP and CL (see in particular chapters 4–6 of [Cunningham 00]). Our research suggested that a profitable way to support LE software development was an architecture that breaks down such programs into components of various types. Because LE practice varies very widely (it is, after all, predominantly a research field), the architecture must avoid restricting the sorts of components that developers can plug into the infrastructure. The GATE framework accomplishes this via an adapted version of the Java Beans component framework from Sun, as described in section 4.2.
GATE components may be implemented by a variety of programming languages and databases, but in each case they are represented to the system as a Java class. This class may do nothing other than call the underlying program, or provide an access layer to a database; on the other hand it may implement the whole component.
GATE components are one of three types:
- LanguageResources (LRs) represent entities such as lexicons, corpora or ontologies;
- ProcessingResources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or ngram modellers;
- VisualResources (VRs) represent visualisation and editing components that participate in GUIs.
The distinction between language resources and processing resources is explored more fully in section D.1.1. Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection of REusable Objects for Language Engineering.
In the rest of this chapter:
- Section 4.3 describes the lifecycle of GATE components;
- Section 4.4 describes how Processing Resources can be grouped into applications;
- Section 4.5 describes the relationship between Language Resources and their datastores;
- Section 4.6 summarises GATE’s set of built-in components;
- Section 4.7 describes how configuration data for Resource types is supplied to GATE.
4.1 The Web and CREOLE [#]
GATE allows resource implementations and Language Resource persistent data to be distributed over the Web, and uses Java annotations for configuration of resources (and GATE itself).
Resource implementations are grouped together as ‘plugins’, stored either in a single JAR file published via the standard Maven repository mechanism, or at a URL (when the resources are in the local file system this would be a file:/ URL). When a plugin is loaded into GATE it looks for a configuration file called creole.xml relative to the plugin URL or inside the plugin JAR file and uses the contents of this file in combination with Java annotations on the source code to determine what resources this plugin declares and, in the case of directory-style plugins, where to find the classes that implement the resource types (typically a JAR file in the plugin directory). GATE retrieves the configuration information from the plugin’s resource classes and adds the resource definitions to the CREOLE register. When a user requests an instantiation of a resource, GATE creates an instance of the resource class in the virtual machine.
Language resource data can be stored in binary serialised form in the local file system.
4.2 The GATE Framework [#]
We can think of the GATE framework as a backplane into which users can plug CREOLE components. The user gives the system a list of plugins to search when it starts up, and components in those plugins are loaded by the system.
The backplane performs these functions:
- component discovery, bootstrapping, loading and reloading;
- management and visualisation of native data structures for common information types;
- generalised data storage and process execution.
A set of components plus the framework is a deployment unit which can be embedded in another application.
At their most basic, all GATE resources are Java Beans, the Java platform’s model of software components. Beans are simply Java classes that obey certain interface conventions:
- beans must have no-argument constructors.
- beans have properties, defined by pairs of methods named by the convention setProp and getProp .
GATE uses Java Beans conventions to construct and configure resources at runtime, and defines interfaces that different component types must implement.
4.3 The Lifecycle of a CREOLE Resource [#]
CREOLE resources exhibit a variety of forms depending on the perspective they are viewed from. Their implementation is as a Java class plus an XML metadata file living at the same URL. When using GATE Developer, resources can be loaded and viewed via the resources tree (left pane) and the ‘create resource’ mechanism. When programming with GATE Embedded, they are Java objects that are obtained by making calls to GATE’s Factory class. These various incarnations are the phases of a CREOLE resource’s ‘lifecycle’. Depending on what sort of task you are using GATE for, you may use resources in any or all of these phases. For example, you may only be interested in getting a graphical view of what GATE’s ANNIE Information Extraction system (see Chapter 6) does; in this case you will use GATE Developer to load the ANNIE resources, and load a document, and create an ANNIE application and run it on the document. If, on the other hand, you want to create your own resources, or modify the Java code of an existing resource (as opposed to just modifying its grammar, for example), you will need to deal with all the lifecycle phases.
The various phases may be summarised as:
- Creating a new resource from scratch (bootstrapping).
- To create the binary image of a resource (a Java class in a JAR file), and the XML file that describes the resource to GATE, you need to create the appropriate .java file(s), compile them and package them as a .jar. GATE provides a Maven archetype to start this process – see Section 7.12. Alternatively you can simply copy code from an existing resource.
- Instantiating a resource in GATE Embedded.
- To create a resource in your own Java code, use GATE’s Factory class (this takes care of parameterising the resource, restoring it from a database where appropriate, etc. etc.). Section 7.2 describes how to do this.
- Loading a resource into GATE Developer.
- To load a resource into GATE Developer, use the various ‘New ... resource’ options from the File menu and elsewhere. See Section 3.1.
- Resource configuration and implementation.
- GATE’s Maven archetype will create an empty resource that does nothing. In order to achieve the behaviour you require, you’ll need to change the Java code and its configuration annotations. See section 4.7 for more details.
4.4 Processing Resources and Applications [#]
PRs can be combined into applications. Applications model a control strategy for the execution of PRs. In GATE, applications are called ‘controllers’ accordingly.
Currently the main application types provided by GATE implement sequential or “pipeline” control flow. There are two main types of pipeline:
- Simple pipelines
- simply group a set of PRs together in order and execute them in turn. The implementing class is called SerialController.
- Corpus pipelines
- are specific for LanguageAnalysers – PRs that are applied to documents and corpora. A corpus pipeline opens each document in the corpus in turn, sets that document as a runtime parameter on each PR, runs all the PRs on the corpus, then closes the document. The implementing class is called SerialAnalyserController.
Conditional versions of these controllers are also available. These allow processing resources to be run conditionally on document features. See Section 3.8.2 for how to use these. If more flexibility is required, the Groovy plugin provides a scriptable controller (see section 7.16.3) whose execution strategy is specified using the Groovy programming language.
Controllers are themselves PRs – in particular a simple pipeline is a standard PR and a corpus pipeline is a LanguageAnalyser – so one pipeline can be nested in another. This is particularly useful with conditional controllers to group together a set of PRs that can all be turned on or off as a group.
There is also a real-time version of the corpus pipeline. When creating such a controller, a timeout parameter needs to be set which determines the maximum amount of time (in milliseconds) allowed for the processing of a document. Documents that take longer to process, are simply ignored and the execution moves to the next document after the timeout interval has lapsed.
All controllers have special handling for processing resources that implement the interface gate.creole.ControllerAwarePR. This interface provides methods that are called by the controller at the start and end of the whole application’s execution – for a corpus pipeline, this means before any document has been processed and after all documents in the corpus have been processed, which is useful for PRs that need to share data structures across the whole corpus, build aggregate statistics, etc. For full details, see the JavaDoc documentation for ControllerAwarePR.
4.5 Language Resources and Datastores [#]
Language Resources can be stored in Datastores. Datastores are an abstract model of disk-based persistence, which can be implemented by various types of storage mechanism. Here are the types implemented:
- Serial Datastores
- are based on Java’s serialisation system, and store data directly into files and directories.
- Lucene Datastores
- is a full-featured annotation indexing and retrieval system. It is provided as part of an extension of the Serial Datastores. See Section 9 for more details.
4.6 Built-in CREOLE Resources [#]
GATE comes with various built-in components:
- Language Resources modelling Documents and Corpora, and various types of Annotation Schema – see Chapter 5.
- Processing Resources that are part of the ANNIE system – see Chapter 6.
- Gazetteers – see Chapter 13.
- Ontologies – see Chapter 14.
- Machine Learning resources – see Chapter 19.
- Alignment tools – see Chapter 20.
- Parsers and taggers – see Chapter 18.
- Other miscellaneous resources – see Chapter 23.
4.7 CREOLE Resource Configuration [#]
This section describes how to supply GATE with the configuration data it needs about a resource, such as what its parameters are, how to display it if it has a visualisation, etc. Several GATE resources can be grouped into a single plugin, which is a directory or JAR file containing an XML configuration file called creole.xml at its root. The creole.xml file provides metadata about the plugin as a whole, the configuration for individual resource classes is given directly in the Java source file using Java annotations.
A creole.xml file has a root element <CREOLE-DIRECTORY> which supports several optional attribute:
- NAME:
- The name of the plugin. Used in the GUI to help identify the plugin in a nicer way than the direcory or artifact name.
- VERSION:
- The version number of the plugin. For example, 3, 3.1, 3.11, 3.12-SNAPSHOT etc.
- DESCRIPTION:
- A short description of the resources provided by the plugin. Note that there is really only space for a single sentence in the GUI.
- GATE-MIN:
- The earliest version of GATE that this plugin is compatible with. This should be in the same format as the version shown in the GATE titlebar, i.e. 8.5 or 8.6-beta1. Do not include the build number information.
Currently all these attributes are optional, as in most cases the information can be pulled from other elements of the plugin metadata; for example, plugins distributued via Maven will use information from the pom.xml if not specified.
For many simple single-JAR plugins the creole.xml file need have no other content – just an empty <CREOLE-DIRECTORY /> element – but there are certain child elements that are used in some types of plugin.
Directory-style plugins need at least one <JAR> child element to tell GATE where to find the classes that implement the plugin’s resources. Each <JAR> element contains a path to a JAR file, which is resolved relative to the location of the creole.xml, for example:
<JAR SCAN="true">myPlugin.jar</JAR>
<JAR>lib/thirdPartyLib.jar</JAR>
</CREOLE-DIRECTORY>
JAR files that contain resource classes must be specified with SCAN="true", which tells GATE to scan the JAR contents to discover resource classes annotated with @CreoleResource (see below). Other JAR files required by the plugin can be specified using other <JAR> elements without SCAN="true".
Plugins can depend on other plugins, for example if a plugin defines a PR which internally makes use of a JAPE transducer then that plugin would declare that it depends on ANNIE (the standard plugin that defines the JAPE transducer PR). This is done with a <REQUIRES> element. To depend on a single-JAR plugin from a Maven repository, use an empty element with attributes GROUP, ARTIFACT and VERSION, for example
<REQUIRES GROUP="uk.ac.gate.plugins"
ARTIFACT="annie"
VERSION="8.5" />
</CREOLE-DIRECTORY>
Directory-style plugins can also depend on other directory-style plugins using a relative path (e.g. <REQUIRES>../other-plugin</REQUIRES>, but this is generally discouraged – if your plugin is likely to be required as a dependency of other plugins then it is better converted to the single JAR Maven style so the dependency can be handled via group/artifact/version co-ordinates.
You may see old plugins with other elements such as <RESOURCE>, this is the older style of configuration in XML, which is now deprecated in favour of the annotations described below.
4.7.1 Configuring Resources using Annotations [#]
The configuration of the resources within a plugin is handled using Java annotation types to embed the configuration data directly in the Java source code. @CreoleResource is used to mark a class as a GATE resource, and parameter information is provided through annotations on the JavaBean set methods. At runtime these annotations are read and used to construct the resource data that is registered with the CREOLE register. The metadata annotation types are all marked @Documented so the CREOLE configuration data will be visible in the generated JavaDoc documentation.
For more detailed information, see the JavaDoc documentation for gate.creole.metadata.
Basic Resource-Level Data
To mark a class as a CREOLE resource, simply use the @CreoleResource annotation (in the gate.creole.metadata package), for example:
2import gate.creole.metadata.*;
3
4@CreoleResource(name = "GATE Tokeniser",
5 comment = "Splits text into tokens and spaces")
6public class Tokeniser extends AbstractLanguageAnalyser {
7 ...
8}
The @CreoleResource annotation provides slots for various configuration values:
- name
- (String) the name of the resource, as it will appear in the ‘New’ menu in GATE Developer. If omitted, defaults to the bare name of the resource class (without a package name).
- comment
- (String) a descriptive comment about the resource, which will appear as the tooltip when hovering over an instance of this resource in the resources tree in GATE Developer. If omitted, no comment is used.
- helpURL
- (String) a URL to a help document on the web for this resource. It is used in the help browser inside GATE Developer.
- isPrivate
- (boolean) should this resource type be hidden from the GATE Developer GUI, so it does not appear in the ‘New’ menus? If omitted, defaults to false (i.e. not hidden).
- icon
- (String) the icon to use to represent the resource in GATE Developer. If omitted, a generic
language resource or processing resource icon is used. The value of this element can
be:
- a plain name such as “Application”, which is prepended with the package name gate.resources.img.svg. and the suffix “Icon”, which is assumed to be a Java class implementing javax.swing.Icon. GATE provides a collection of these icon classes which are generated from SVG files and are fully scalable for high-DPI monitors.
- a path to an image file inside the plugin’s JAR, starting with a forward slash, e.g. /myplugin/images/icon.png
- interfaceName
- (String) the interface type implemented by this resource, for example a new type of document would specify "gate.Document" here.
- tool
- (boolean) is this resource type a tool? The “tool” flag identifies things like resource helpers and resources that contribute items to the tools menu in GATe Developer.
- autoInstances
- (array of @AutoInstance annotations) definitions for any instances of this resource that should be created automatically when the plugin is loaded. If omitted, no auto-instances are created by default. Auto-instances are useful for things like document formats and tools which contribute behaviour to other GATE resources, and which should be available by default whenever the plugin is loaded.
For visual resources only, the following elements are also available:
- guiType
- (GuiType enum) the type of GUI this resource defines. The options are LARGE (the VR should appear in the main right-hand panel of the GUI) or SMALL (the VR should appear in the bottom left hand corner below the resources tree).
- resourceDisplayed
- (String) the class name of the resource type that this VR displays, e.g. "gate.Corpus". Any resource whose type is assignable to this type will be displayed with this viewer, so for example a VR that can display all types of document would specify gate.Document, whereas a VR that can only display the default GATE document implementation would specify gate.corpora.DocumentImpl.
- mainViewer
- (boolean) is this VR the ‘most important’ viewer for its displayed resource type? If there are several different viewers that are all applicable to a particular resource type, the mainViewer hint helps GATE Developer decide which one should be initially visible as the selected tab.
For annotation viewers, you should specify an annotationTypeDisplayed element giving the annotation type that the viewer can display (e.g. Sentence).
Resource Parameters
Parameters are declared by placing annotations on their JavaBean set methods. To mark a setter method as a parameter, use the @CreoleParameter annotation, for example:
public void setAbbrListUrl(URL listUrl) {
...
GATE will infer the parameter’s name from the name of the JavaBean property in the usual way (i.e. strip off the leading set and convert the following character to lower case, so in this example the name is abbrListUrl). The parameter name is not taken from the name of the method parameter. The parameter’s type is inferred from the type of the method parameter (java.net.URL in this case).
The annotation elements of @CreoleParameter are as follows:
- comment
- (String) an optional descriptive comment about the parameter.
- defaultValue
- (String) the optional default value for this parameter. The value is specified as a string but is converted to the relevant type by GATE according to the conversions described below.
- suffixes
- (String) for parameters of type URL or ResourceReference, a semicolon-separated list of default file suffixes that this parameter accepts.
- collectionElementType
- (Class) for Collection-valued parameters, the type of the elements in the collection. This can usually be inferred from the generic type information, for example public void setIndices(List<Integer> indices), but must be specified if the set method’s parameter has a raw (non-parameterized) type.
Parameter default values must be specified as strings, but parameters can be of any type and GATE applies the following rules to convert the default string into an appropriate value for the parameter type:
- String
- if the parameter is of type String the default value is used directly
- Primitive wrapper types e.g. Integer
- the string is passed to the relevant valueOf method
- enum types
- the value is passed to Enum.valueOf
- java.net.URL or gate.creole.ResourceReference
- the string is parsed as a URI, and if the URI is relative then it is resolved against the plugin (for directory-style plugins this means against the location of creole.xml and for Maven plugins it is the root of the plugin JAR file)
- collection types (Set, List, etc.)
- the string is treated as a semicolon-separated list of values, and each value is converted to the collection’s element type following these same rules.
- gate.FeatureMap
- the string is parsed as “feature1=value1;feature2=value2” etc. (a semicolon-separated list of “name=value” pairs)
- any other java.* type
- if the type has a constructor taking a String then that constructor is called with the default string as its parameter.
If there is no default specified, the default value is null.
Mutually-exclusive parameters are handled by adding a disjunction="label" and priority=n to the @CreoleParameter annotation – all parameters that share the same label are grouped in the same disjunction, and will be offered in order of priority. The parameter with the smallest priority value will be the one listed first, and thus the one that is offered initially when creating a resource of this type in GATE Developer. For example, the following is a simplified extract from gate.corpora.DocumentImpl:
2public void setSourceUrl(URL src) { /∗ ∗/ }
3
4@CreoleParameter(disjunction="src", priority=2)
5public void setStringContent(String content) { /∗ ∗/ }
This declares the parameters “stringContent” and “sourceUrl” as mutually-exclusive, and when creating an instance of this resource in GATE Developer the parameter that will be shown initially is sourceUrl. To set stringContent instead the user must select it from the drop-down list. Parameters with the same declared priority value will appear next to each other in the list, but their relative ordering is not specified. Parameters with no explicit priority are always listed after those that do specify a priority.
Optional and runtime parameters are marked using extra annotations, for example:
Runtime parameters apply only to Processing Resources, and are parameters that are not used when the resource is initialised but instead only when it is executed. An “optional” parameter is one that does not have to be set before creating or executing the resource.
Inheritance
A resource will inherit any configuration data that was not explicitly specified from annotations on its parent class and on any interfaces it implements. Specifically, if you do not specify a comment, interfaceName, icon, annotationTypeDisplayed or the GUI-related elements (guiType and resourceDisplayed) on your @CreoleResource annotation then GATE will look up the class tree for other @CreoleResource annotations, first on the superclass, its superclass, etc., then at any implemented interfaces, and use the first value it finds. This is useful if you are defining a family of related resources that inherit from a common base class.
The resource name and the isPrivate and mainViewer flags are not inherited.
Parameter definitions are inherited in a similar way. For example, the gate.LanguageAnalyser interface provides two parameter definitions via annotated set methods, for the corpus and document parameters. Any @CreoleResource annotated class that implements LanguageAnalyser, directly or indirectly, will get these parameters automatically.
Of course, there are some cases where this behaviour is not desirable, for example if a subclass calculates a value for a superclass parameter rather than having the user set it directly. In this case you can hide the parameter by overriding the set method in the subclass and using a marker annotation:
2 public void setSomeParam(String someParam) {
3 super.setSomeParam(someParam);
4 }
The overriding method will typically just call the superclass one, as its only purpose is to provide a place to put the @HiddenCreoleParameter annotation.
Alternatively, you may want to override some of the configuration for a parameter but inherit the rest from the superclass. Again, this is handled by trivially overriding the set method and re-annotating it:
2 @CreoleParameter(comment = "Location of the grammar file",
3 suffixes = "jape")
4 public void setGrammarUrl(URL grammarLocation) {
5 ...
6 }
7
8 @Optional
9 @RunTime
10 @CreoleParameter(comment = "Feature to set on success")
11 public void setSuccessFeature(String name) {
12 ...
13 }
2 // subclass
3
4 // override the default value, inherit everything else
5 @CreoleParameter(defaultValue = "resources/defaultGrammar.jape")
6 public void setGrammarUrl(URL url) {
7 super.setGrammarUrl(url);
8 }
9
10 // we want the parameter to be required in the subclass
11 @Optional(false)
12 @CreoleParameter
13 public void setSuccessFeature(String name) {
14 super.setSuccessFeature(name);
15 }
Note that for backwards compatibility, data is only inherited from superclass annotations if the subclass is itself annotated with @CreoleResource.
4.7.2 Loading Third-Party Libraries in a Maven plugin [#]
A Maven plugin is distributed as a single JAR file, but if the plugin depends on any third-party libraries these can be specified as dependencies in the corresponding POM file in the usual Maven way as compile or runtime scoped dependencies.
If one plugin has a compile-time dependency on another (as opposed to simply a runtime dependency when one plugin creates resources defined in another) then you should specify the dependency in your POM as <scope>provided</scope> as well as declaring it in creole.xml with group/artifact/version.
4.8 Tools: How to Add Utilities to GATE Developer [#]
Visual Resources allow a developer to provide a GUI to interact with a particular resource type (PR or LR), but sometimes it is useful to provide general utilities for use in the GATE Developer GUI that are not tied to any specific resource type. Examples include the annotation diff tool and the Groovy console (provided by the Groovy plugin), both of which are self-contained tools that display in their own top-level window. To support this, the CREOLE model has the concept of a tool.
A resource type is marked as a tool by setting tool = true in the @CreoleResource annotation. If a resource is declared to be a tool, and written to implement the gate.gui.ActionsPublisher interface, then whenever an instance of the resource is created its published actions will be added to the “Tools” menu in GATE Developer.
Since the published actions of every instance of the resource will be added to the tools menu, it is best not to use this mechanism on resource types that can be instantiated by the user. The “tool” marker is best used in combination with the “private” flag (to hide the resource from the list of available types in the GUI) and one or more hidden autoinstance definitions to create a limited number of instances of the resource when its defining plugin is loaded. See the GroovySupport resource in the Groovy plugin for an example of this.
4.8.1 Putting Your Tools in a Sub-Menu [#]
If your plugin provides a number of tools (or a number of actions from the same tool) you may wish to organise your actions into one or more sub-menus, rather than placing them all on the single top-level tools menu. To do this, you need to put a special value into the actions returned by the tool’s getActions() method:
The key must be GateConstants.MENU_PATH_KEY and the value must be an array of strings. Each string in the array represents the name of one level of sub-menus. Thus in the example above the action would be placed under “Tools → Acme toolkit → Statistics”. If no MENU_PATH_KEY value is provided the action will be placed directly on the Tools menu.
4.8.2 Adding Tools To Existing Resource Types [#]
While Visual Resources (VR) allow you to add new features to a particular resource they have a number of shortcomings. Firstly not every new feature will require a full VR; often a new entry on the resources right-click menu will suffice. More importantly new feautres added via a VR are only available while the VR is open. A Resource Helper is a form of Tool, as above, which can add new menu options to any existing resource type without requiring a VR.
A Resource Helper is defined in the same way as a Tool (by setting the tool = true feature of the @CreoleResource annotation and loaded via an autoinstance definition) but must also extend the gate.gui.ResourceHelper class. A Resource Helper can then return a set of actions for a given resource which will be added to its right-click menu. See the FastInfosetExporter resource in the “Format: FastInfoset” plugin for an example of how this works.
A Resource Helper may also make new API calls accessable to allow similar functionality to be made available to GATE Embedded, see Section 7.19 for more details on how this works.