…Noam Chomsky’s answer in Secrets, Lies and Democracy (David Barsamian 1994; Odonian) to ‘What do you think about the Internet?’
‘I think that there are good things about it, but there are also aspects of it that concern and worry me. This is an intuitive response – I can’t prove it – but my feeling is that, since people aren’t Martians or robots, direct face-to-face contact is an extremely important part of human life. It helps develop self-understanding and the growth of a healthy personality.
‘You just have a different relationship to somebody when you’re looking at them than you do when you’re punching away at a keyboard and some symbols come back. I suspect that extending that form of abstract and remote relationship, instead of direct, personal contact, is going to have unpleasant effects on what people are like. It will diminish their humanity, I think.’
Chomsky, quoted at http://photo.net/wtr/dead-trees/53015.htm.
The GATE architecture is based on components: reusable chunks of software with well-defined interfaces that may be deployed in a variety of contexts. The design of GATE is based on an analysis of previous work on infrastructure for LE, and of the typical types of software entities found in the fields of NLP and CL (see in particular chapters 4–6 of [Cunningham 00]). Our research suggested that a profitable way to support LE software development was an architecture that breaks down such programs into components of various types. Because LE practice varies very widely (it is, after all, predominantly a research field), the architecture must avoid restricting the sorts of components that developers can plug into the infrastructure. The GATE framework accomplishes this via an adapted version of the Java Beans component framework from Sun, as described in section 4.2.
GATE components may be implemented by a variety of programming languages and databases, but in each case they are represented to the system as a Java class. This class may do nothing other than call the underlying program, or provide an access layer to a database; on the other hand it may implement the whole component.
GATE components are one of three types:
The distinction between language resources and processing resources is explored more fully in section C.1.1. Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection of REusable Objects for Language Engineering.
In the rest of this chapter:
GATE allows resource implementations and Language Resource persistent data to be distributed over the Web, and uses Java annotations and XML for configuration of resources (and GATE itself).
Resource implementations are grouped together as ‘plugins’, stored at a URL (when the resources are in the local file system this can be a file:/ URL). When a plugin is loaded into GATE it looks for a configuration file called creole.xml relative to the plugin URL and uses the contents of this file to determine what resources this plugin declares and where to find the classes that implement the resource types (typically these classes are stored in a JAR file in the plugin directory). Configuration data for the resources may be stored directly in the creole.xml file, or it may be stored as Java annotations on the resource classes themselves; in either case GATE retrieves this configuration information and adds the resource definitions to the CREOLE register. When a user requests an instantiation of a resource, GATE creates an instance of the resource class in the virtual machine.
Language resource data can be stored in binary serialised form in the local file system.
We can think of the GATE framework as a backplane into which users can plug CREOLE components. The user gives the system a list of URLs to search when it starts up, and components at those locations are loaded by the system.
The backplane performs these functions:
A set of components plus the framework is a deployment unit which can be embedded in another application.
At their most basic, all GATE resources are Java Beans, the Java platform’s model of software components. Beans are simply Java classes that obey certain interface conventions:
GATE uses Java Beans conventions to construct and configure resources at runtime, and defines interfaces that different component types must implement.
CREOLE resources exhibit a variety of forms depending on the perspective they are viewed from. Their implementation is as a Java class plus an XML metadata file living at the same URL. When using GATE Developer, resources can be loaded and viewed via the resources tree (left pane) and the ‘create resource’ mechanism. When programming with GATE Embedded, they are Java objects that are obtained by making calls to GATE’s Factory class. These various incarnations are the phases of a CREOLE resource’s ‘lifecycle’. Depending on what sort of task you are using GATE for, you may use resources in any or all of these phases. For example, you may only be interested in getting a graphical view of what GATE’s ANNIE Information Extraction system (see Chapter 6) does; in this case you will use GATE Developer to load the ANNIE resources, and load a document, and create an ANNIE application and run it on the document. If, on the other hand, you want to create your own resources, or modify the Java code of an existing resource (as opposed to just modifying its grammar, for example), you will need to deal with all the lifecycle phases.
The various phases may be summarised as:
PRs can be combined into applications. Applications model a control strategy for the execution of PRs. In GATE, applications are called ‘controllers’ accordingly.
Currently only sequential, or pipeline, execution is supported. There are two main types of pipeline:
Conditional versions of these controllers are also available. These allow processing resources to be run conditionally on document features. See Section 3.7.2 for how to use these. If more flexibility is required, the Groovy plugin provides a scriptable controller (see section 7.16.3) whose execution strategy is specified using the Groovy programming language.
Controllers are themselves PRs – in particular a simple pipeline is a standard PR and a corpus pipeline is a LanguageAnalyser – so one pipeline can be nested in another. This is particularly useful with conditional controllers to group together a set of PRs that can all be turned on or off as a group.
There is also a real-time version of the corpus pipeline. When creating such a controller, a timeout parameter needs to be set which determines the maximum amount of time (in milliseconds) allowed for the processing of a document. Documents that take longer to process, are simply ignored and the execution moves to the next document after the timeout interval has lapsed.
All controllers have special handling for processing resources that implement the interface gate.creole.ControllerAwarePR. This interface provides methods that are called by the controller at the start and end of the whole application’s execution – for a corpus pipeline, this means before any document has been processed and after all documents in the corpus have been processed, which is useful for PRs that need to share data structures across the whole corpus, build aggregate statistics, etc. For full details, see the JavaDoc documentation for ControllerAwarePR.
Language Resources can be stored in Datastores. Datastores are an abstract model of disk-based persistence, which can be implemented by various types of storage mechanism. Here are the types implemented:
GATE comes with various built-in components:
This section describes how to supply GATE with the configuration data it needs about a resource, such as what its parameters are, how to display it if it has a visualisation, etc. Several GATE resources can be grouped into a single plugin, which is a directory containing an XML configuration file called creole.xml. Configuration data for the plugin’s resources can be given in the creole.xml file or directly in the Java source file using Java 5 annotations.
A creole.xml file has a root element <CREOLE-DIRECTORY>, but the further contents of this element depend on the configuration style. The following three sections discuss the different styles – all-XML, all-annotations and a mixture of the two.
To configure your resources in the creole.xml file, the <CREOLE-DIRECTORY> element should contain one <RESOURCE> element for each resource type in the plugin. The <RESOURCE> elements may optionally be contained within a <CREOLE> element (to allow a single creole.xml file to be built up by concatenating multiple separate files). For example:
Each resource must give a name, a Java class and the JAR file that it can be loaded from. The above example is taken from the Parser_Minipar plugin, and defines a single resource with a number of parameters.
The full list of valid elements under <RESOURCE> is as follows:
For visual resources, a <GUI> element should also be provided. This takes a TYPE attribute, which can have the value LARGE or SMALL. LARGE means that the visual resource is a large viewer and should appear in the main part of the GATE Developer window on the right hand side, SMALL means the VR is a small viewer which appears in the space below the resources tree in the bottom left. The <GUI> element supports the following sub-elements:
For annotation viewers, you should specify an <ANNOTATION_TYPE_DISPLAYED> element giving the annotation type that the viewer can display (e.g. Sentence).
Resources may also have parameters of various types. These resources, from the GATE distribution, illustrate the various types of parameters:
Parameters may be optional, and may have default values (and may have comments to describe their purpose, which is displayed by GATE Developer during interactive parameter setting).
Some PR parameters are execution time (RUNTIME), some are initialisation time. E.g. at execution time a doc is supplied to a language analyser; at initialisation time a grammar may be supplied to a language analyser.
The <PARAMETER> tag takes the following attributes:
It is possible for two or more parameters to be mutually exclusive (i.e. a user must specify one or the other but not both). In this case the <PARAMETER> elements should be grouped together under an <OR> element.
The type of the parameter is specified as the text of the <PARAMETER> element, and the type supplied must match the return type of the parameter’s get method. Any reference type (class, interface or enum) may be used as the parameter type, including other resource types – in this case GATE Developer will offer a list of the loaded instances of that resource as options for the parameter value. Primitive types (char, boolean, …) are not supported, instead you should use the corresponding wrapper type (java.lang.Character, java.lang.Boolean, …). If the getter returns a parameterized type (e.g. List<Integer>) you should just specify the raw type (java.util.List) here2.
The DEFAULT string is converted to the appropriate type for the parameter - java.lang.String parameters use the value directly, primitive wrapper types e.g. java.lang.Integer use their respective valueOf methods, and other built-in Java types can have defaults specified provided they have a constructor taking a String.
The type java.net.URL is treated specially: if the default string is not an absolute URL (e.g. http://gate.ac.uk/) then it is treated as a path relative to the location of the creole.xml file. Thus a DEFAULT of ‘resources/main.jape’ in the file file:/opt/MyPlugin/creole.xml is treated as the absolute URL file:/opt/MyPlugin/resources/main.jape.
For Collection-valued parameters multiple values may be specified, separated by semicolons, e.g. ‘foo;bar;baz’; if the parameter’s type is an interface – Collection or one of its sub-interfaces (e.g. List) – a suitable concrete class (e.g. ArrayList, HashSet) will be chosen automatically for the default value.
For parameters of type gate.FeatureMap multiple name=value pairs can be specified, e.g. ‘kind=word;orth=upperInitial’. For enum-valued parameters the default string is taken as the name of the enum constant to use. Finally, if no DEFAULT attribute is specified, the default value is null.
As an alternative to the XML configuration style, GATE provides Java 5 annotation types to embed the configuration data directly in the Java source code. @CreoleResource is used to mark a class as a GATE resource, and parameter information is provided through annotations on the JavaBean set methods. At runtime these annotations are read and mapped into the equivalent entries in creole.xml before parsing. The metadata annotation types are all marked @Documented so the CREOLE configuration data will be visible in the generated JavaDoc documentation.
For more detailed information, see the JavaDoc documentation for gate.creole.metadata.
To use annotation-driven configuration a creole.xml file is still required but it need only contain the following:
This tells GATE to load myPlugin.jar and scan its contents looking for resource classes annotated with @CreoleResource. Other JAR files required by the plugin can be specified using other <JAR> elements without SCAN="true".
To mark a class as a CREOLE resource, simply use the @CreoleResource annotation (in the gate.creole.metadata package), for example:
The @CreoleResource annotation provides slots for all the values that can be specified under <RESOURCE> in creole.xml, except <CLASS> (inferred from the name of the annotated class) and <JAR> (taken to be the JAR containing the class):
For visual resources only, the following elements are also available:
For annotation viewers, you should specify an annotationTypeDisplayed element giving the annotation type that the viewer can display (e.g. Sentence).
Parameters are declared by placing annotations on their JavaBean set methods. To mark a setter method as a parameter, use the @CreoleParameter annotation, for example:
GATE will infer the parameter’s name from the name of the JavaBean property in the usual way (i.e. strip off the leading set and convert the following character to lower case, so in this example the name is abbrListUrl). The parameter name is not taken from the name of the method parameter. The parameter’s type is inferred from the type of the method parameter (java.net.URL in this case).
The annotation elements of @CreoleParameter correspond to the attributes of the <PARAMETER> tag in the XML configuration style:
Mutually-exclusive parameters (such as would be grouped in an <OR> in creole.xml) are handled by adding a disjunction="label" to the @CreoleParameter annotation – all parameters that share the same label are grouped in the same disjunction.
Optional and runtime parameters are marked using extra annotations, for example:
Unlike with pure XML configuration, when using annotations a resource will inherit any configuration data that was not explicitly specified from annotations on its parent class and on any interfaces it implements. Specifically, if you do not specify a comment, interfaceName, icon, annotationTypeDisplayed or the GUI-related elements (guiType and resourceDisplayed) on your @CreoleResource annotation then GATE will look up the class tree for other @CreoleResource annotations, first on the superclass, its superclass, etc., then at any implemented interfaces, and use the first value it finds. This is useful if you are defining a family of related resources that inherit from a common base class.
The resource name and the isPrivate and mainViewer flags are not inherited.
Parameter definitions are inherited in a similar way. This is one of the big advantages of annotation configuration over pure XML – if one resource class extends another then with pure XML configuration all the parent class’s parameter definitions must be duplicated in the subclass’s creole.xml definition. With annotations, parameters are inherited from the parent class (and its parent, etc.) as well as from any interfaces implemented. For example, the gate.LanguageAnalyser interface provides two parameter definitions via annotated set methods, for the corpus and document parameters. Any @CreoleResource annotated class that implements LanguageAnalyser, directly or indirectly, will get these parameters automatically.
Of course, there are some cases where this behaviour is not desirable, for example if a subclass calculates a value for a superclass parameter rather than having the user set it directly. In this case you can hide the parameter by overriding the set method in the subclass and using a marker annotation:
The overriding method will typically just call the superclass one, as its only purpose is to provide a place to put the @HiddenCreoleParameter annotation.
Alternatively, you may want to override some of the configuration for a parameter but inherit the rest from the superclass. Again, this is handled by trivially overriding the set method and re-annotating it:
Note that for backwards compatibility, data is only inherited from superclass annotations if the subclass is itself annotated with @CreoleResource. If the subclass is not annotated then GATE assumes that all its configuration is contained in creole.xml in the usual way.
It is possible and often useful to mix and match the XML and annotation-driven configuration styles. The rule is always that anything specified in the XML takes priority over the annotations. The following examples show what this allows.
Suppose you have a plugin from some third party that uses annotation-driven configuration. You don’t have the source code but you would like to override the default value for one of the parameters of one of the plugin’s resources. You can do this in the creole.xml:
The default value for the listUrl parameter in the annotated class will be replaced by your value.
For resources like document formats, where there should always and only be one instance in GATE at any time, it makes sense to put the auto-instance definitions in the @CreoleResource annotation. But if the automatically created instances are a convenience rather than a necessity it may be better to define them in XML so other users can disable them without re-compiling the class:
If you would prefer to use XML configuration for your own resources, but would like to benefit from the parameter inheritance features of the annotation-driven approach, you can write a normal creole.xml file with all your configuration and just add a blank @CreoleResource annotation to your class. For example:
N.B. Without the @CreoleResource the parameters would not be inherited.
Visual Resources allow a developer to provide a GUI to interact with a particular resource type (PR or LR), but sometimes it is useful to provide general utilities for use in the GATE Developer GUI that are not tied to any specific resource type. Examples include the annotation diff tool and the Groovy console (provided by the Groovy plugin), both of which are self-contained tools that display in their own top-level window. To support this, the CREOLE model has the concept of a tool.
A resource type is marked as a tool by using the <TOOL/> element in its creole.xml definition, or by setting tool = true if using the @CreoleResource annotation configuration style. If a resource is declared to be a tool, and written to implement the gate.gui.ActionsPublisher interface, then whenever an instance of the resource is created its published actions will be added to the “Tools” menu in GATE Developer.
Since the published actions of every instance of the resource will be added to the tools menu, it is best not to use this mechanism on resource types that can be instantiated by the user. The “tool” marker is best used in combination with the “private” flag (to hide the resource from the list of available types in the GUI) and one or more hidden autoinstance definitions to create a limited number of instances of the resource when its defining plugin is loaded. See the GroovySupport resource in the Groovy plugin for an example of this.
If your plugin provides a number of tools (or a number of actions from the same tool) you may wish to organise your actions into one or more sub-menus, rather than placing them all on the single top-level tools menu. To do this, you need to put a special value into the actions returned by the tool’s getActions() method:
The key must be GateConstants.MENU_PATH_KEY and the value must be an array of strings. Each string in the array represents the name of one level of sub-menus. Thus in the example above the action would be placed under “Tools → Acme toolkit → Statistics”. If no MENU_PATH_KEY value is provided the action will be placed directly on the Tools menu.
1The JavaBeans spec allows is instead of get for properties of the primitive type boolean, but GATE does not support parameters with primitive types. Parameters of type java.lang.Boolean (the wrapper class) are permitted, but these have get accessors anyway.
2In this particular case, as the type is a collection, you would specify java.lang.Integer as the ITEM_CLASS_NAME.