Chapter 7
GATE Embedded [#]
7.1 Quick Start with GATE Embedded [#]
Embedding GATE-based language processing in other applications using GATE Embedded (the GATE API) is straightforward:
- add $GATE_HOME/bin/gate.jar and the JAR files in $GATE_HOME/lib to the Java CLASSPATH ($GATE_HOME is the GATE root directory)
- tell Java that the GATE Unicode Kit is an extension: -Djava.ext.dirs=$GATE_HOME/lib/ext N.B. This is only necessary for GUI applications that need to support Unicode text input; other applications such as command line or web applications don’t generally need GUK.
- initialise GATE with gate.Gate.init();
- program to the framework API.
For example, this code will create the ANNIE extraction system:
1 // initialise the GATE library
2 Gate.init();
3
4 // load ANNIE as an application from a gapp file
5 SerialAnalyserController controller = (SerialAnalyserController)
6 PersistenceManager.loadObjectFromFile(new File(new File(
7 Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR),
8 ANNIEConstants.DEFAULT_FILE));
If you want to use resources from any plugins, you need to load the plugins before calling createResource:
1 Gate.init();
2
3 // need Tools plugin for the Morphological analyser
4 Gate.getCreoleRegister().registerDirectories(
5 new File(Gate.getPluginsHome(), "Tools").toURL()
6 );
7
8 ...
9
10 ProcessingResource morpher = (ProcessingResource)
11 Factory.createResource("gate.creole.morph.Morph");
Instead of creating your processing resources individually using the Factory, you can create your application in GATE Developer, save it using the ‘save application state’ option (see Section 3.8.3), and then load the saved state from your code. This will automatically reload any plugins that were loaded when the state was saved, you do not need to load them manually.
1 Gate.init();
2
3 CorpusController controller = (CorpusController)
4 PersistenceManager.loadObjectFromFile(new File("savedState.xgapp"));
5
6 // loadObjectFromUrl is also available
There are many examples of using GATE Embedded available at http://gate.ac.uk/wiki/code-repository/.
7.2 Resource Management in GATE Embedded [#]
As outlined earlier, GATE defines three different types of resources:
- Language Resources
- : (LRs) entities that hold linguistic data.
- Processing Resources
- : (PRs) entities that process data.
- Visual Resources
- : (VRs) components used for building graphical interfaces.
These resources are collectively named CREOLE1 resources.
All CREOLE resources have some associated meta-data in the form of an entry in a special XML file named creole.xml. The most important role of that meta-data is to specify the set of parameters that a resource understands, which of them are required and which not, if they have default values and what those are. The valid parameters for a resource are described in the resource’s section of its creole.xml file or in Java annotations on the resource class – see Section 4.7.
All resource types have creation-time parameters that are used during the initialisation phase. Processing Resources also have run-time parameters that get used during execution (see Section 7.5 for more details).
Controllers are used to define GATE applications and have the role of controlling the execution flow (see Section 7.6 for more details).
This section describes how to create and delete CREOLE resources as objects in a running Java virtual machine. This process involves using GATE’s Factory class2, and, in the case of LRs, may also involve using a DataStore.
CREOLE resources are Java Beans; creation of a resource object involves using a default constructor, then setting parameters on the bean, then calling an init() method. The Factory takes care of all this, makes sure that the GATE Developer GUI is told about what is happening (when GUI components exist at runtime), and also takes care of restoring LRs from DataStores. A programmer using GATE Embedded should never call the constructor of a resource: always use the Factory!
Creating a resource involves providing the following information:
- fully qualified class name for the resource. This is the only required value. For all the rest, defaults will be used if actual values are not provided.
- values for the creation time parameters.†
- initial values for resource features.† For an explanation on features see Section 7.4.2.
- a name for the new resource;
† Parameters and features need to be provided in the form of a GATE Feature Map which is essentially a java Map (java.util.Map) implementation, see Section 7.4.2 for more details on Feature Maps.
Creating a resource via the Factory involves passing values for any create-time parameters that require setting to the Factory’s createResource method. If no parameters are passed, the defaults are used. So, for example, the following code creates a default ANNIE part-of-speech tagger:
1Gate.getCreoleRegister().registerDirectories(new File(
2 Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR).toURI().toURL());
3FeatureMap params = Factory.newFeatureMap(); //empty map:default params
4ProcessingResource tagger = (ProcessingResource)
5 Factory.createResource("gate.creole.POSTagger", params);
Note that if the resource created here had any parameters that were both mandatory and had no default value, the createResource call would throw an exception. In this case, all the information needed to create a tagger is available in default values given in the tagger’s XML definition (in plugins/ANNIE/creole.xml):
<RESOURCE>
<NAME>ANNIE POS Tagger</NAME> <COMMENT>Mark Hepple’s Brill-style POS tagger</COMMENT> <CLASS>gate.creole.POSTagger</CLASS> <PARAMETER NAME="document" COMMENT="The document to be processed" RUNTIME="true">gate.Document</PARAMETER> .... <PARAMETER NAME="rulesURL" DEFAULT="resources/heptag/ruleset" COMMENT="The URL for the ruleset file" OPTIONAL="true">java.net.URL</PARAMETER> </RESOURCE> |
Here the two parameters shown are either ‘runtime’ parameters, which are set before a PR is executed, or have a default value (in this case the default rules file is distributed with GATE itself).
When creating a Document, however, the URL of the source for the document must be provided3. For example:
1URL u = new URL("http://gate.ac.uk/hamish/");
2FeatureMap params = Factory.newFeatureMap();
3params.put("sourceUrl", u);
4Document doc = (Document)
5 Factory.createResource("gate.corpora.DocumentImpl", params);
Note that the document created here is transient: when you quit the JVM the document will no longer exist. If you want the document to be persistent, you need to store it in a DataStore (see Section 7.4.5).
Apart from createResource() methods with different signatures, Factory also provides some shortcuts for common operations, listed in table 7.1.
|
GATE maintains various data structures that allow the retrieval of loaded resources. When a resource is no longer required, it needs to be removed from those structures in order to remove all references to it, thus making it a candidate for garbage collection. This is achieved using the deleteResource(Resource res) method on Factory.
Simply removing all references to a resource from the user code will NOT be enough to make the resource collect-able. Not calling Factory.deleteResource() will lead to memory leaks!
7.3 Using CREOLE Plugins [#]
As shown in the examples above, in order to use a CREOLE resource the relevant CREOLE plugin must be loaded. Processing Resources, Visual Resources and Language Resources other than Document, Corpus and DataStore all require that the appropriate plugin is first loaded. When using Document, Corpus or DataStore, you do not need to first load a plugin. The following API calls listed in table 7.2 are relevant to working with CREOLE plugins.
|
7.4 Language Resources [#]
This section describes the implementation of documents and corpora in GATE.
7.4.1 GATE Documents
Documents are modelled as content plus annotations (see Section 7.4.4) plus features (see Section 7.4.2).
The content of a document can be any implementation of the gate.DocumentContent interface; the features are <attribute, value> pairs stored a Feature Map. Attributes are String values while the values can be any Java object.
The annotations are grouped in sets (see section 7.4.3). A document has a default (anonymous) annotations set and any number of named annotations sets.
Documents are defined by the gate.Document interface and there is also a provided implementation:
- gate.corpora.DocumentImpl
- : transient document. Can be stored persistently through Java serialisation.
Main Document functions are presented in table 7.3.
|
7.4.2 Feature Maps [#]
All CREOLE resources as well as the Controllers and the annotations can have attached meta-data in the form of Feature Maps.
A Feature Map is a Java Map (i.e. it implements the java.util.Map interface) and holds <attribute-name, attribute-value> pairs. The attribute names are Strings while the values can be any Java Objects.
The use of non-Serialisable objects as values is strongly discouraged.
Feature Maps are created using the gate.Factory.newFeatureMap() method.
The actual implementation for FeatureMaps is provided by the gate.util.SimpleFeatureMapImpl class.
Objects that have features in GATE implement the gate.util.FeatureBearer interface which has only the two accessor methods for the object features: FeatureMap getFeatures() and void setFeatures(FeatureMap features).
etting a particular feature from an object |
7.4.3 Annotation Sets [#]
A GATE document can have one or more annotation layers — an anonymous one, (also called default), and as many named ones as necessary.
An annotation layer is organised as a Directed Acyclic Graph (DAG) on which the nodes are particular locations —anchors— in the document content and the arcs are made out of annotations reaching from the location indicated by the start node to the one pointed by the end node (see Figure 7.1 for an illustration). Because of the graph metaphor, the annotation layers are also called annotation graphs. In terms of Java objects, the annotation layers are represented using the Set paradigm as defined by the collections library and they are hence named annotation sets. The terms of annotation layer, graph and set are interchangeable and refer to the same concept when used in this book.
An annotation set holds a number of annotations and maintains a series of indices in order to provide fast access to the contained annotations.
The GATE Annotation Sets are defined by the gate.AnnotationSet interface and there is a default implementation provided:
- gate.annotation.AnnotationSetImpl
- annotation set implementation used by transient documents.
The annotation sets are created by the document as required. The first time a particular annotation set is requested from a document it will be transparently created if it doesn’t exist.
Tables 7.4 and 7.5 list the most used Annotation Set functions.
|
|
7.4.4 Annotations [#]
An annotation, is a form of meta-data attached to a particular section of document content. The connection between the annotation and the content it refers to is made by means of two pointers that represent the start and end locations of the covered content. An annotation must also have a type (or a name) which is used to create classes of similar annotations, usually linked together by their semantics.
An Annotation is defined by:
- start node
- a location in the document content defined by an offset.
- end node
- a location in the document content defined by an offset.
- type
- a String value.
- features
- (see Section 7.4.2).
- ID
- an Integer value. All annotations IDs are unique inside an annotation set.
In GATE Embedded, annotations are defined by the gate.Annotation interface and implemented by the gate.annotation.AnnotationImpl class. Annotations exist only as members of annotation sets (see Section 7.4.3) and they should not be directly created by means of a constructor. Their creation should always be delegated to the containing annotation set.
7.4.5 GATE Corpora [#]
A corpus in GATE is a Java List (i.e. an implementation of java.util.List) of documents. GATE corpora are defined by the gate.Corpus interface and the following implementations are available:
- gate.corpora.CorpusImpl
- used for transient corpora.
- gate.corpora.SerialCorpusImpl
- used for persistent corpora that are stored in a serial datastore (i.e. as a directory in a file system).
Apart from implementation for the standard List methods, a Corpus also implements the methods in table 7.6.
|
Creating a corpus from all XML files in a directory |
Using a DataStore
Assuming that you have a DataStore already open called myDataStore, this code will ask the datastore to take over persistence of your document, and to synchronise the memory representation of the document with the disk storage:
Document persistentDoc = myDataStore.adopt(doc, mySecurity);
myDataStore.sync(persistentDoc); |
When you want to restore a document (or other LR) from a datastore, you make the same createResource call to the Factory as for the creation of a transient resource, but this time you tell it the datastore the resource came from, and the ID of the resource in that datastore:
1 URL u = ....; // URL of a serial datastore directory
2 SerialDataStore sds = new SerialDataStore(u.toString());
3 sds.open();
4
5 // getLrIds returns a list of LR Ids, so we get the first one
6 Object lrId = sds.getLrIds("gate.corpora.DocumentImpl").get(0);
7
8 // we need to tell the factory about the LR’s ID in the data
9 // store, and about which datastore it is in - we do this
10 // via a feature map:
11 FeatureMap features = Factory.newFeatureMap();
12 features.put(DataStore.LR_ID_FEATURE_NAME, lrId);
13 features.put(DataStore.DATASTORE_FEATURE_NAME, sds);
14
15 // read the document back
16 Document doc = (Document)
17 Factory.createResource("gate.corpora.DocumentImpl", features);
7.5 Processing Resources [#]
Processing Resources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or ngram modellers.
They are created using the GATE Factory in manner similar the Language Resources. Besides the creation-time parameters they also have a set of run-time parameters that are set by the system just before executing them.
Analysers are a particular type of processing resources in the sense that they always have a document and a corpus among their run-time parameters.
The most used methods for Processing Resources are presented in table 7.7
|
7.6 Controllers [#]
Controllers are used to create GATE applications. A Controller handles a set of Processing Resources and can execute them following a particular strategy. GATE provides a series of serial controllers (i.e. controllers that run their PRs in sequence):
- gate.creole.SerialController:
- a serial controller that takes any kind of PRs.
- gate.creole.SerialAnalyserController:
- a serial controller that only accepts Language Analysers as member PRs.
- gate.creole.ConditionalSerialController:
- a serial controller that accepts all types of PRs and that allows the inclusion or exclusion of member PRs from the execution chain according to certain run-time conditions (currently features on the document being processed are used).
- gate.creole.ConditionalSerialAnalyserController:
- a serial controller that only accepts Language Analysers and that allows the conditional run of member PRs.
7.7 Duplicating a Resource [#]
Sometimes, particularly in a multi-threaded application, it is useful to be able to create an independent copy of an existing PR, controller or LR. The obvious way to do this is to call createResource again, passing the same class name, parameters, features and name, and for many resources this will do the right thing. However there are some resources for which this may be insufficient (e.g. controllers, which also need to duplicate their PRs), unsafe (if a PR uses temporary files, for instance), or simply inefficient. For example for a large gazetteer this would involve loading a second copy of the lists into memory and compiling them into a second identical state machine representation, but a much more efficient way to achieve the same behaviour would be to use a SharedDefaultGazetteer (see section 13.11), which can re-use the existing state machine.
The GATE Factory provides a duplicate method which takes an existing resource instance and creates and returns an independent copy of the resource. By default it uses the algorithm described above, extracting the parameter values from the template resource and calling createResource to create a duplicate. However, if a particular resource type knows of a better way to duplicate itself it can implement the CustomDuplication interface, and provide its own duplicate method which the factory will use instead of performing the default duplication algorithm. A caller who needs to duplicate an existing resource can simply call Factory.duplicate to obtain a copy, which will be constructed in the appropriate way depending on the resource type.
Note that the duplicate object returned by Factory.duplicate will not necessarily be of the same class as the original object. However the contract of Factory.duplicate specifies that where the original object implements any of a list of core GATE interfaces, the duplicate can be assumed to implement the same ones – if you duplicate a DefaultGazetteer the result may not be an instance of DefaultGazetteer but it is guaranteed to implement the Gazetteer interface.
Full details of how to implement a custom duplicate method in your own resource type can be found in the JavaDoc documentation for the CustomDuplication interface and the Factory.duplicate method.
7.8 Persistent Applications [#]
GATE Embedded allows the persistent storage of applications in a format based on XML serialisation. This is particularly useful for applications management and distribution. A developer can save the state of an application when he/she stops working on its design and continue developing it in a next session. When the application reaches maturity it can be deployed to the client site using the same method.
When an application (i.e. a Controller) is saved, GATE will actually only save the values for the parameters used to create the Processing Resources that are contained in the application. When the application is reloaded, all the PRs will be re-created using the saved parameters.
Many PRs use external resources (files) to define their behaviour and, in most cases, these files are identified using URLs. During the saving process, all the URLs are converted relative URLs based on the location of the application file. This way, if the resources are packaged together with the application file, the entire application can be reliably moved to a different location.
API access to application saving and loading is provided by means of two static methods on the gate.util.persistence.PersistenceManager class, listed in table 7.8.
|
7.9 Ontologies
Starting from GATE version 3.1, support for ontologies has been added. Ontologies are nominally Language Resources but are quite different from documents and corpora and are detailed in chapter 14.
Classes related to ontologies are to be found in the gate.creole.ontology package and its sub-packages. The top level package defines an abstract API for working with ontologies while the sub-packages contain concrete implementations. A client program should only use the classes and methods defined in the API and never any of the classes or methods from the implementation packages.
The entry point to the ontology API is the gate.creole.ontology.Ontology interface which is the base interface for all concrete implementations. It provides methods for accessing the class hierarchy, listing the instances and the properties.
Ontology implementations are available through plugins. Before an ontology language resource can be created using the gate.Factory and before any of the classes and methods in the API can be used, one of the implementing ontology plugins must be loaded. For details see chapter 14.
7.10 Creating a New Annotation Schema [#]
An annotation schema (see Section 3.4.6) can be brought inside GATE through the creole.xml file. By using the AUTOINSTANCE element, one can create instances of resources defined in creole.xml. The gate.creole.AnnotationSchema (which is the Java representation of an annotation schema file) initializes with some predefined annotation definitions (annotation schemas) as specified by the GATE team.
Example from GATE’s internal creole.xml (in src/gate/resources/creole):
<!-- Annotation schema -->
<RESOURCE> <NAME>Annotation schema</NAME> <CLASS>gate.creole.AnnotationSchema</CLASS> <COMMENT>An annotation type and its features</COMMENT> <PARAMETER NAME="xmlFileUrl" COMMENT="The url to the definition file" SUFFIXES="xml;xsd">java.net.URL</PARAMETER> <AUTOINSTANCE> <PARAM NAME ="xmlFileUrl" VALUE="schema/AddressSchema.xml" /> </AUTOINSTANCE> <AUTOINSTANCE> <PARAM NAME ="xmlFileUrl" VALUE="schema/DateSchema.xml" /> </AUTOINSTANCE> <AUTOINSTANCE> <PARAM NAME ="xmlFileUrl" VALUE="schema/FacilitySchema.xml" /> </AUTOINSTANCE> <!-- etc. --> </RESOURCE> |
In order to create a gate.creole.AnnotationSchema object from a schema annotation file, one must use the gate.Factory class;
1FeatureMap params = new FeatureMap();\\
2param.put("xmlFileUrl",annotSchemaFile.toURL());\\
3AnnotationSchema annotSchema = \\
4Factory.createResurce("gate.creole.AnnotationSchema", params);
Note: All the elements and their values must be written in lower case, as XML is defined as case sensitive and the parser used for XML Schema inside GATE searches is case sensitive.
In order to be able to write XML Schema definitions, the ones defined in GATE (resources/creole/schema) can be used as a model, or the user can have a look at http://www.w3.org/2000/10/XMLSchema for a proper description of the semantics of the elements used.
Some examples of annotation schemas are given in Section 5.4.1.
7.11 Creating a New CREOLE Resource [#]
To create a new resource you need to:
- write a Java class that implements GATE’s beans model;
- compile the class, and any others that it uses, into a Java Archive (JAR) file;
- write some XML configuration data for the new resource;
- tell GATE the URL of the new JAR and XML files.
GATE Developer helps you with this process by creating a set of directories and files that implement a basic resource, including a Java code file and a Makefile. This process is called ‘bootstrapping’.
For example, let’s create a new component called GoldFish, which will be a Processing Resource that looks for all instances of the word ‘fish’ in a document and adds an annotation of type ‘GoldFish’.
First start GATE Developer (see Section 2.2). From the ‘Tools’
menu select ‘BootStrap Wizard’, which will pop up the dialogue in figure 7.2. The meaning of the data entry fields:
- The ‘resource name’ will be displayed when GATE Developer loads the resource, and will be the name of the directory the resource lives in. For our example: GoldFish.
- ‘Resource package’ is the Java package that the class representing the resource will be created in. For our example: sheffield.creole.example.
- ‘Resource type’ must be one of Language, Processing or Visual Resource. In this case we’re going to process documents (and add annotations to them), so we select ProcessingResource.
- ‘Implementing class name’ is the name of the Java class that represents the resource. For our example: GoldFish.
- The ‘interfaces implemented’ field allows you to add other interfaces (e.g. gate.creole.ControllerAwarePR4) that you would like your new resource to implement. In this case we just leave the default (which is to implement the gate.ProcessingResource interface).
- The last field selects the directory that you want the new resource created in. For our example: z:/tmp.
Now we need to compile the class and package it into a JAR file. The bootstrap wizard creates an Ant build file that makes this very easy – so long as you have Ant set up properly, you can simply run
ant jar
|
This will compile the Java source code and package the resulting classes into GoldFish.jar. If you don’t have your own copy of Ant, you can use the one bundled with GATE - suppose your GATE is installed at /opt/gate-5.0-snapshot, then you can use /opt/gate-5.0-snapshot/bin/ant jar to build.
You can now load this resource into GATE; see Section 3.6. The default Java code that was created for our GoldFish resource looks like this:
1/*
2 * GoldFish.java
3 *
4 * You should probably put a copyright notice here. Why not use the
5 * GNU licence? (See http://www.gnu.org/.)
6 *
7 * hamish, 26/9/2001
8 *
9 * $Id: howto.tex,v 1.130 2006/10/23 12:56:37 ian Exp $
10 */
11
12package sheffield.creole.example;
13
14import java.util.*;
15import gate.*;
16import gate.creole.*;
17import gate.util.*;
18
19/**
20 * This class is the implementation of the resource GOLDFISH.
21 */
22@CreoleResource(name = "GoldFish",
23 comment = "Add a descriptive comment about this resource")
24public class GoldFish extends AbstractProcessingResource
25 implements ProcessingResource {
26
27
28} // class GoldFish
The default XML configuration for GoldFish looks like this:
<!-- creole.xml GoldFish -->
<!-- hamish, 26/9/2001 --> <!-- $Id: howto.tex,v 1.130 2006/10/23 12:56:37 ian Exp $ --> <CREOLE-DIRECTORY> <JAR SCAN="true">GoldFish.jar</JAR> </CREOLE-DIRECTORY> |
The directory structure containing these files
is shown in figure 7.3. GoldFish.java lives in the src/sheffield/creole/example directory. creole.xml and build.xml are in the top GoldFish directory. The lib directory is for libraries; the classes directory is where Java class files are placed; the doc directory is for documentation. These last two, plus GoldFish.jar are created by Ant.
This process has the advantage that it creates a complete source tree and build structure for the component, and the disadvantage that it creates a complete source tree and build structure for the component. If you already have a source tree, you will need to chop out the bits you need from the new tree (in this case GoldFish.java and creole.xml) and copy it into your existing one.
See the example code at http://gate.ac.uk/wiki/code-repository/.
7.12 Adding Support for a New Document Format [#]
In order to add a new document format, one needs to extend the gate.DocumentFormat class and to implement an abstract method called:
This method is supposed to implement the functionality of each format reader and to create annotations on the document. Finally the document’s old content will be replaced with a new one containing only the text between markups.
If one needs to add a new textual reader will extend the gate.corpora.TextualDocumentFormat and override the unpackMarkup(doc) method.
This class needs to be implemented under the Java bean specifications because it will be instantiated by GATE using Factory.createResource() method.
The init() method that one needs to add and implement is very important because in here the reader defines its means to be selected successfully by GATE. What one needs to do is to add some specific information into certain static maps defined in DocumentFormat class, that will be used at reader detection time.
After that, a definition of the reader will be placed into the one’s creole.xml file and the reader will be available to GATE.
We present for the rest of the section a complete three step example of adding such a reader. The reader we describe in here is an XML reader.
Step 1
Create a new class called XmlDocumentFormat that extends
gate.corpora.TextualDocumentFormat.
Step 2
Implement the unpackMarkup(Document doc) which performs the required functionality for the reader. Add XML detection means in init() method:
1public Resource init() throws ResourceInstantiationException{
2 // Register XML mime type
3 MimeType mime = new MimeType("text","xml");
4 // Register the class handler for this mime type
5 mimeString2ClassHandlerMap.put(mime.getType()+ "/" + mime.getSubtype(),
6 this);
7 // Register the mime type with mine string
8 mimeString2mimeTypeMap.put(mime.getType() + "/" + mime.getSubtype(),
9 mime);
10 // Register file suffixes for this mime type
11 suffixes2mimeTypeMap.put("xml",mime);
12 suffixes2mimeTypeMap.put("xhtm",mime);
13 suffixes2mimeTypeMap.put("xhtml",mime);
14 // Register magic numbers for this mime type
15 magic2mimeTypeMap.put("<?xml",mime);
16 // Set the mimeType for this language resource
17 setMimeType(mime);
18 return this;
19}// init()
More details about the information from those maps can be found in Section 5.5.1
Step 3
Add the following creole definition in the creole.xml document.
<RESOURCE>
<NAME>My XML Document Format</NAME> <CLASS>mypackage.XmlDocumentFormat</CLASS> <AUTOINSTANCE/> <PRIVATE/> </RESOURCE> |
More information on the operation of GATE’s document format analysers may be found in Section 5.5.
7.13 Using GATE Embedded in a Multithreaded Environment [#]
GATE Embedded can be used in multithreaded applications, so long as you observe a few restrictions. First, you must initialise GATE by calling Gate.init() exactly once in your application, typically in the application startup phase before any concurrent processing threads are started.
Secondly, you must not make calls that affect the global state of GATE (e.g. loading or unloading plugins) in more than one thread at a time. Again, you would typically load all the plugins your application requires at initialisation time. It is safe to create instances of resources in multiple threads concurrently.
Thirdly, it is important to note that individual GATE processing resources, language resources and controllers are by design not thread safe – it is not possible to use a single instance of a controller/PR/LR in multiple threads at the same time – but for a well written resource it should be possible to use several different instances of the same resource at once, each in a different thread. When writing your own resource classes you should bear the following in mind, to ensure that your resource will be useable in this way.
- Avoid static data. Where possible, you should avoid using static fields in your class, and you should try and take all configuration data via the CREOLE parameters you declare in your creole.xml file. System properties may be appropriate for truly static configuration, such as the location of an external executable, but even then it is generally better to stick to CREOLE parameters – a user may wish to use two different instances of your PR, each talking to a different executable.
- Read parameters at the correct time. Init-time parameters should be read in the init() (and reInit()) method, and for processing resources runtime parameters should be read at each execute().
- Use temporary files correctly. If your resource makes use of external temporary files you should create them using File.createTempFile() at init or execute time, as appropriate. Do not use hardcoded file names for temporary files.
- If there are objects that can be shared between different instances of your resource, make sure these objects are accessed either read-only, or in a thread-safe way. In particular you must be very careful if your resource can take other resource instances as init or runtime parameters (e.g. the Flexible Gazetteer, Section 13.7).
Of course, if you are writing a PR that is simply a wrapper around an external library that imposes these kinds of limitations there is only so much you can do. If your resource cannot be made safe you should document this fact clearly.
All the standard ANNIE PRs are safe when independent instances are used in different threads concurrently, as are the standard transient document, transient corpus and controller classes. A typical pattern of development for a multithreaded GATE-based application is:
- Develop your GATE processing pipeline in GATE Developer.
- Save your pipeline as a .gapp file.
- In your application’s initialisation phase, load n copies of the pipeline using PersistenceManager.loadObjectFromFile() (see the Javadoc documentation for details), or load the pipeline once and then make copies of it using Factory.duplicate as described in section 7.7, and either give one copy to each thread or store them in a pool (e.g. a LinkedList).
- When you need to process a text, get one copy of the pipeline from the pool, and return it to the pool when you have finished processing.
Alternatively you can use the Spring Framework as described in the next section to handle the pooling for you.
7.14 Using GATE Embedded within a Spring Application [#]
GATE Embedded provides helper classes to allow GATE resources to be created and managed by the Spring framework. For Spring 2.0 or later, GATE Embedded provides a custom namespace handler that makes them extremely easy to use. To use this namespace, put the following declarations in your bean definition file:
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:gate="http://gate.ac.uk/ns/spring" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://gate.ac.uk/ns/spring http://gate.ac.uk/ns/spring.xsd"> |
You can have Spring initialise GATE:
<gate:init gate-home="WEB-INF" user-config-file="WEB-INF/user.xml">
<gate:preload-plugins> <value>WEB-INF/ANNIE</value> <value>http://example.org/gate-plugin</value> </gate:preload-plugins> </gate:init> |
To create a GATE resource, use the <gate:resource> element.
<gate:resource id="sharedOntology" scope="singleton"
resource-class="gate.creole.ontology.owlim.OWLIMOntologyLR"> <gate:parameters> <entry key="rdfXmlURL"> <value type="org.springframework.core.io.Resource" >WEB-INF/ontology.rdf</value> </entry> </gate:parameters> </gate:resource> |
If you are familiar with Spring you will see that <gate:parameters> uses the same format as the standard <map> element, but values whose type is a Spring Resource will be converted to URLs before being passed to the GATE resource.
You can load a GATE saved application with
<gate:saved-application location="WEB-INF/application.gapp" scope="prototype">
<gate:customisers> <gate:set-parameter pr-name="custom transducer" name="ontology" ref="sharedOntology" /> </gate:customisers> </gate:saved-application> |
‘Customisers’ are used to customise the application after it is loaded. In the example above, we load a singleton copy of an ontology which is then shared between all the separate instances of the (prototype) application. The <gate:set-parameter> customiser accepts all the same ways to provide a value as the standard Spring <property> element (a ”value” or ”ref” attribute, or a sub-element - <value>, <list>, <bean>, <gate:resource> …).
The <gate:add-pr> customiser provides support for the case where most of the application is in a saved state, but we want to create one or two extra PRs with Spring (maybe to inject other Spring beans as init parameters) and add them to the pipeline.
<gate:saved-application ...>
<gate:customisers> <gate:add-pr add-before="OrthoMatcher" ref="myPr" /> </gate:customisers> </gate:saved-application> |
By default, the <gate:add-pr> customiser adds the target PR at the end of the pipeline, but an add-before or add-after attribute can be used to specify the name of a PR before (or after) which this PR should be placed. Alternatively, an index attribute places the PR at a specific (0-based) index into the pipeline. The PR to add can be specified either as a ‘ref’ attribute, or with a nested <bean> or <gate:resource> element.
7.14.1 Duplication in Spring [#]
The above example defines the <gate:application> as a prototype-scoped bean, which means the saved application state will be loaded afresh each time the bean is fetched from the bean factory (either explicitly using getBean or implicitly when it is injected as a dependency of another bean). However in many cases it is better to load the application once and then duplicate it as required (as described in section 7.7), as this allows resources to optimise their memory usage, for example by sharing a single in-memory representation of a large gazetteer list between several instances of the gazetteer PR. This approach is supported by the <gate:duplicate> tag.
<gate:duplicate id="theApp">
<gate:saved-application location="/WEB-INF/application.xgapp" /> </gate:duplicate> |
The <gate:duplicate> tag acts like a prototype bean definition, in that each time it is fetched or injected it will call Factory.duplicate to create a new duplicate of its template resource (declared as a nested element or referenced by the template-ref attribute). However the tag also keeps track of all the duplicate instances it has returned over its lifetime, and will ensure they are released (using Factory.deleteResource) when the Spring context is shut down.
The <gate:duplicate> tag also supports customisers, which will be applied to the newly-created duplicate resource before it is returned. This is subtly different from applying the customisers to the template resource itself, which would cause them to be applied once to the original resource before it is first duplicated.
Finally, <gate:duplicate> takes an optional boolean attribute return-template. If set to false (or omitted, as this is the default behaviour), the tag always returns a duplicate — the original template resource is used only as a template and is not made available for use. If set to true, the first time the bean defined by the tag is injected or fetched, the original template resource is returned. Subsequent uses of the tag will return duplicates. Generally speaking, it is only safe to set return-template="true" when there are no customisers, and when the duplicates will all be created up-front before any of them are used. If the duplicates will be created asynchronously (e.g. with a dynamically expanding pool, see below) then it is possible that, for example, a template application may be duplicated in one thread whilst it is being executed by another thread, which may lead to unpredictable behaviour.
7.14.2 Spring pooling [#]
In a multithreaded application it is vital that individual GATE resources are not used in more than one thread at the same time. Because of this, multithreaded applications that use GATE Embedded often need to use some form of pooling to provided thread-safe access to GATE components. This can be managed by hand, but the Spring framework has built-in tools to support transparent pooling of Spring-managed beans. Spring can create a pool of identical objects, then expose a single “proxy” object (offering the same interface) for use by clients. Each method call on the proxy object will be routed to an available member of the pool in such a way as to guarantee that each member of the pool is accessed by no more than one thread at a time.
Since the pooling is handled at the level of method calls, this approach is not used to create a pool of GATE resources directly — making use of a GATE PR typically involves a sequence of method calls (at least setDocument(doc), execute() and setDocument(null)), and creating a pooling proxy for the resource may result in these calls going to different members of the pool. Instead the typical use of this technique is to define a helper object with a single method that internally calls the GATE API methods in the correct sequence, and then create a pool of these helpers. The interface gate.util.DocumentProcessor and its associated implementation gate.util.LanguageAnalyserDocumentProcessor are useful for this. The DocumentProcessor interface defines a processDocument method that takes a GATE document and performs some processing on it. LanguageAnalyserDocumentProcessor implements this interface using a GATE LanguageAnalyser (such as a saved “corpus pipeline” application) to do the processing. A pool of LanguageAnalyserDocumentProcessor instances can be exposed through a proxy which can then be called from several threads.
The machinery to implement this is all built into Spring, but the configuration typically required to enable it is quite fiddly, involving at least three co-operating bean definitions. Since the technique is so useful with GATE Embedded, GATE provides a special syntax to configure pooling in a simple way. Given the <gate:duplicate id="theApp"> definition from the previous section we can create a DocumentProcessor proxy that can handle up to five concurrent requests as follows:
<bean id="processor"
class="gate.util.LanguageAnalyserDocumentProcessor"> <property name="analyser" ref="theApp" /> <gate:pooled-proxy max-size="5" /> </bean> |
The <gate:pooled-proxy> element decorates a singleton bean definition. It converts the original definition to prototype scope and replaces it with a singleton proxy delegating to a pool of instances of the prototype bean. The pool parameters are controlled by attributes of the <gate:pooled-proxy> element, the most important ones being:
- max-size
- The maximum size of the pool. If more than this number of threads try to call methods on the proxy at the same time, the others will (by default) block until an object is returned to the pool.
- initial-size
- The default behaviour of Spring’s pooling tools is to create instances in the pool on demand (up to the max-size). This attribute instead causes initial-size instances to be created up-front and added to the pool when it is first created.
- when-exhausted-action-name
- What to do when the pool is exhausted (i.e. there are already max-size concurrent calls in progress and another one arrives). Should be set to one of WHEN_EXHAUSTED_BLOCK (the default, meaning block the excess requests until an object becomes free), WHEN_EXHAUSTED_GROW (create a new object anyway, even though this pushes the pool beyond max-size) or WHEN_EXHAUSTED_FAIL (cause the excess calls to fail with an exception).
Many more options are available, corresponding to the properties of the Spring CommonsPoolTargetSource class. These allow you, for example, to configure a pool that dynamically grows and shrinks as necessary, releasing objects that have been idle for a set amount of time. See the JavaDoc documentation of CommonsPoolTargetSource (and the documentation for Apache commons-pool) for full details.
Note that the <gate:pooled-proxy> technique is not tied to GATE in any way, it is simply an easy way to configure standard Spring beans and can be used with any bean that needs to be pooled, not just objects that make use of GATE.
7.14.3 Further reading [#]
These custom elements all define various factory beans. For full details, see the JavaDocs for gate.util.spring (the factory beans) and gate.util.spring.xml (the gate: namespace handler). The main Spring framework API documentation is the best place to look for more detail on the pooling facilities provided by Spring AOP.
Note: the former approach using factory methods of the gate.util.spring.SpringFactory class will still work, but should be considered deprecated in favour of the new factory beans.
7.15 Using GATE Embedded within a Tomcat Web Application [#]
Embedding GATE in a Tomcat web application involves several steps.
- Put the necessary JAR files (gate.jar and all or most of the jars in gate/lib) in your webapp/WEB-INF/lib.
- Put the plugins that your application depends on in a suitable location (e.g. webapp/WEB-INF/plugins).
- Create suitable gate.xml configuration files for your environment.
- Set the appropriate paths in your application before calling Gate.init().
This process is detailed in the following sections.
7.15.1 Recommended Directory Structure
You will need to create a number of other files in your web application to allow GATE to work:
- Site and user gate.xml config files - we highly recommend defining these specifically for the web application, rather than relying on the default files on your application server.
- The plugins your application requires.
In this guide, we assume the following layout:
webapp/
WEB-INF/ gate.xml user-gate.xml plugins/ ANNIE/ etc. |
7.15.2 Configuration Files
Your gate.xml (the ‘site-wide configuration file’) should be as simple as possible:
<?xml version="1.0" encoding="UTF-8" ?>
<GATE> <GATECONFIG Save_options_on_exit="false" Save_session_on_exit="false" /> </GATE> |
Similarly, keep the user-gate.xml (the ‘user config file’) simple:
<?xml version="1.0" encoding="UTF-8" ?>
<GATE> <GATECONFIG Known_plugin_path=";" Load_plugin_path=";" /> </GATE> |
This way, you can control exactly which plugins are loaded in your webapp code.
7.15.3 Initialization Code
Given the directory structure shown above, you can initialize GATE in your web application like this:
1// imports
2...
3public class MyServlet extends HttpServlet {
4 private static boolean gateInited = false;
5
6 public void init() throws ServletException {
7 if(!gateInited) {
8 try {
9 ServletContext ctx = getServletContext();
10
11 // use /path/to/your/webapp/WEB-INF as gate.home
12 File gateHome = new File(ctx.getRealPath("/WEB-INF"));
13
14 Gate.setGateHome(gateHome);
15 // thus webapp/WEB-INF/plugins is the plugins directory, and
16 // webapp/WEB-INF/gate.xml is the site config file.
17
18 // Use webapp/WEB-INF/user-gate.xml as the user config file,
19 // to avoid confusion with your own user config.
20 Gate.setUserConfigFile(new File(gateHome, "user-gate.xml"));
21
22 Gate.init();
23 // load plugins, for example...
24 Gate.getCreoleRegister().registerDirectories(
25 ctx.getResource("/WEB-INF/plugins/ANNIE"));
26
27 gateInited = true;
28 }
29 catch(Exception ex) {
30 throw new ServletException("Exception initialising GATE",
31 ex);
32 }
33 }
34 }
35}
Once initialized, you can create GATE resources using the Factory in the usual way (for example, see Section 7.1 for an example of how to create an ANNIE application). You should also read Section 7.13 for important notes on using GATE Embedded in a multithreaded application.
Instead of an initialization servlet you could also consider doing your initialization in a ServletContextListener, or using Spring (see Section 7.14).
7.16 Groovy for GATE [#]
Groovy is a dynamic programming language based on Java. Groovy is not used in the core GATE distribution, so to enable the Groovy features in GATE you must first load the Groovy plugin. Loading this plugin:
- provides access to the Groovy scripting console (configured with some extensions for GATE) from the GATE Developer “Tools” menu.
- provides a PR to run a Groovy script over documents.
- enhances a number of core GATE classes with additional convenience methods that can be used from any Groovy code including the console, the script PR, and any Groovy class that uses the GATE Embedded API.
This section describes these features in detail, but assumes that the reader already has some knowledge of the Groovy language. If you are not already familiar with Groovy you should read this section in conjunction with Groovy’s own documentation at http://groovy.codehaus.org/.
7.16.1 Groovy Scripting Console for GATE [#]
Loading the Groovy plugin in GATE Developer will provide a “Groovy Console” item in the Tools/Groovy Tools menu. This menu item opens the standard Groovy console window (http://groovy.codehaus.org/Groovy+Console).
To help scripting GATE in Groovy, the console is pre-configured to import all classes from the gate, gate.annotation, gate.util, gate.jape and gate.creole.ontology packages of the core GATE API5. This means you can refer to classes and interfaces such as Factory, AnnotationSet, Gate, etc. without needing to prefix them with a package name. In addition, the following (read-only) variable bindings are pre-defined in the Groovy Console.
- corpora: a list of loaded corpora LRs (Corpus)
- docs: a list of all loaded document LRs (DocumentImpl)
- prs: a list of all loaded PRs
- apps: a list of all loaded Applications (AbstractController)
These variables are automatically updated as resources are created and deleted in GATE.
Here’s an example script. It finds all documents with a feature “annotator” set to “fred”, and puts them in a new corpus called “fredsDocs”.
You can find other examples (and add your own) in the Groovy script repository on the GATE Wiki: http://gate.ac.uk/wiki/groovy-recipes/.
Why won’t the ‘Groovy executing’ dialog go away? Sometimes, when you execute a Groovy script through the console, a dialog will appear, saying “Groovy is executing. Please wait”. The dialog fails to go away even when the script has ended, and cannot be closed by clicking the “Interrupt” button. You can, however, continue to use the Groovy Console, and the dialog will usually go away next time you run a script. This is not a GATE problem: it is a Groovy problem.
7.16.2 Groovy scripting PR [#]
The Groovy scripting PR enables you to load and execute Groovy scripts as part of a GATE application pipeline. The Groovy scripting PR is made available when you load the Groovy plugin via the plugin manager.
Parameters [#]
The Groovy scripting PR has a single initialisation parameter
- scriptURL: the path to a valid Groovy script
It has three runtime parameters
- inputASName: an optional annotation set intended to be used as input by the PR (but note that the PR has access to all annotation sets)
- outputASName: an optional annotation set intended to be used as output by the PR (but note that the PR has access to all annotation sets)
- scriptParams: optional parameters for the script. In a creole.xml file, these should be specified as key=value pairs, each pair separated by a comma. For example: ’name=fred,type=person’ . In the GATE GUI, these are specified via a dialog.
Script bindings [#]
As with the Groovy console described above, and with JAPE right-hand-side Java code, Groovy scripts run by the scripting PR implicitly import all classes from the gate, gate.annotation, gate.util, gate.jape and gate.creole.ontology packages of the core GATE API. The Groovy scripting PR also makes available the following bindings, which you can use in your scripts:
- doc: the current document (DocumentImpl)
- content: the string content of the current document
- inputAS: the annotation set specified by inputASName in the PRs runtime parameters
- outputAS: the annotation set specified by outputASName in the PRs runtime parameters
Note that inputAS and outputAS are intended to be used as input and output AnnotationSets. This is, however, a convention: there is nothing to stop a script writing to or reading from any AnnotationSet.
Passing parameters to the script [#]
In addition to the above bindings, one further binding is available to the script:
- scriptParams: a FeatureMap with keys and values as specified by the scriptParams runtime parameter
For example, if you were to create a scriptParams runtime parameter for your PR, with the keys and values: ’name=fred,type=person’, then the values could be retrieved in your script via scriptParams.name and scriptParams.type
Examples [#]
The plugin directory Groovy/resources/scripts contains some example scripts. Below is the code for a naive regular expression PR.
1
2matcher = content =~ scriptParams.regex
3while(matcher.find())
4 outputAS.add(matcher.start(),
5 matcher.end(),
6 scriptParams.type,
7 Factory.newFeatureMap())
The script needs to have the runtime parameter scriptParams set with keys and values as follows:
- regex: the Groovy regular expression that you want to match e.g. [^\s]*ing
- type: the type of the annotation to create for each regex match, e.g. regexMatch
When the PR is run over a document, the script will first make a matcher over the document content for the regular expression given by the regex parameter. It will iterate over all matches for this regular expression, adding a new annotation for each, with a type as given by the type parameter.
7.16.3 Utility methods [#]
Loading the Groovy plugin adds some additional methods to several of the core GATE API classes and interfaces using the Groovy “mixin” mechanism. Any Groovy code that runs after the plugin has been loaded can make use of these additional methods, including snippets run in the Groovy console, scripts run using the Script PR, and any other Groovy code that uses the GATE Embedded API.
The methods that are injected come from two classes. The gate.Utils class (part of the core GATE API in gate.jar) defines a number of static methods that can be used to simplify common tasks such as getting the string covered by an annotation or annotation set, finding the start or end offset of an annotation (or set), etc. These methods do not use any Groovy-specific types, so they are usable from pure Java code in the usual way as well as being mixed in for use in Groovy. Additionally, the class gate.groovy.GateGroovyMethods (part of the Groovy plugin) provides methods that use Groovy types such as closures and ranges.
The added methods include:
- Unified access to the start and end offsets of an Annotation, AnnotationSet or Document: e.g. someAnnotation.start() or anAnnotationSet.end()
- Simple access to the DocumentContent or string covered by an annotation or annotation set: document.stringFor(anAnnotation), document.contentFor(annotationSet)
- Simple access to the length of an annotation or document, either as an int (annotation.length()) or a long (annotation.lengthLong()).
- A method to construct a FeatureMap from any map, to support constructions like def params = [sourceUrl:’http://gate.ac.uk’, encoding:’UTF-8’].toFeatureMap()
- A method to convert an annotation set into a List of annotations in the order they appear in the document, for iteration in a predictable order: annSet.inDocumentOrder().collect { it.type }
- The each, eachWithIndex and collect methods for a corpus have been redefined to properly load and unload documents if the corpus is stored in a datastore.
- Various getAt methods to support constructions like annotationSet["Token"] (get all Token annotations from the set), annotationSet[15..20] (get all annotations between offsets 15 and 20), documentContent[0..10] (get the document content between offsets 0 and 10).
- A withResource method for any resource, which calls a closure with the resource passed as a parameter, and ensures that the resource is properly deleted when the closure completes (analagous to the default Groovy method InputStream.withStream).
For full details, see the source code or javadoc documentation for these two classes.
7.17 Saving Config Data to gate.xml
Arbitrary feature/value data items can be saved to the user’s gate.xml file via the following API calls:
To get the config data: Map configData = Gate.getUserConfig().
To add config data simply put pairs into the map: configData.put("my new config key", "value");.
To write the config data back to the XML file: Gate.writeUserConfig();.
Note that new config data will simply override old values, where the keys are the same. In this way defaults can be set up by putting their values in the main gate.xml file, or the site gate.xml file; they can then be overridden by the user’s gate.xml file.
7.18 Annotation merging through the API [#]
If we have annotations about the same subject on the same document from different annotators, we may need to merge those annotations to form a unified annotation. Two approaches for merging annotations are implemented in the API, via static methods in the class gate.util.AnnotationMerging.
The two methods have very similar input and output parameters. Each of the methods takes an array of annotation sets, which should be the same annotation type on the same document from different annotators, as input. A single feature can also be specified as a parameter (or given asnull if no feature is to be specified).
The output is a map, the key of which is one merged annotation and the value of which represents the annotators (in terms of the indices of the array of annotation sets) who support the annotation. The methods also have a boolean input parameter to indicate whether or not the annotations from different annotators are based on the same set of instances, which can be determined by the static method public boolean isSameInstancesForAnnotators(AnnotationSet[] annsA) in the class gate.util.IaaCalculation. One instance corresponds to all the annotations with the same span. If the annotation sets are based on the same set of instances, the merging methods will ensure that the merged annotations are on the same set of instances.
The two methods corresponding to those described for the Annotation Merging plugin described in Section 19.11. They are:
- The Method public static void mergeAnnogation(AnnotationSet[] annsArr, String
nameFeat,
HashMap<Annotation,String>mergeAnns, int numMinK, boolean isTheSameInstances) merges the annotations stored in the array annsArr. The merged annotation is put into the map mergeAnns, with a key of the merged annotation and value of a string containing the indices of elements in the annotation set array annsArr which contain that annotation. NumMinK specifies the minimal number of the annotators supporting one merged annotation. The boolean parameter isTheSameInstances indicate if or not those annotation sets for merging are based on the same instances. - Method public static void mergeAnnogationMajority(AnnotationSet[] annsArr, String nameFeat, HashMap<Annotation, String>mergeAnns, boolean isTheSameInstances) selects the annotations which the majority of the annotators agree on. The meanings of parameters are the same as those in the above method.
1CREOLE stands for Collection of REusable Objects for Language Engineering
2Fully qualified name: gate.Factory
3Alternatively a string giving the document source may be provided.
5These are the same classes that are imported by default for use in Java code on the right hand side of JAPE rules.