Combining GATE and UIMA [#]
UIMA (Unstructured Information Management Architecture) is a platform for natural language processing developed by IBM. It has many similarities to the GATE architecture – it represents documents as text plus annotations, and allows users to define pipelines of analysis engines that manipulate the document (or Common Analysis Structure in UIMA terminology) in much the same way as processing resources do in GATE. IBM has released an implementation of the UIMA architecture, called the UIMA SDK, that provides support for building analysis components in Java and C++ and running them either locally on one machine, or deploying them as services that can be accessed remotely. The SDK is available for download from http://alphaworks.ibm.com/tech/uima/.
Clearly, it would be useful to be able to include UIMA components in GATE applications and vice-versa, letting GATE users take advantage of UIMA’s flexible deployment options and UIMA users access JAPE and the many useful plugins already available in GATE. This chapter describes the interoperability layer provided as part of GATE to support this. The UIMA-GATE interoperability layer is based on the UIMA SDK version 1.2.3. Later versions of UIMA may also work but have not been tested.
The rest of this chapter assumes that you have at least a basic understanding of core UIMA concepts, such as type systems, primitive and aggregate text analysis engines (TAEs), feature structures, the format of AE XML descriptors, etc. It will probably be helpful to refer to the relevant sections of the UIMA SDK User’s Guide and Reference (supplied with the SDK) alongside this document.
There are two main parts to the interoperability layer:
- A wrapper to allow a UIMA Text Analysis Engine (TAE), whether primitive or aggregate, to be used within GATE as a Processing Resource (PR).
- A wrapper to allow a GATE processing pipeline (specifically a CorpusController) to be used within UIMA as a TAE.
The two components operate in very similar ways. Given a document in the source form (either a GATE Document or a UIMA CAS), a document in the target form is created with a copy of the source document’s text. Some of the annotations from the source are transferred to the target, according to a mapping defined by the user, and the target component is then run. Finally, some of the annotations on the updated target document are then transferred back to the source, according to the user-defined mapping.
16.1 Embedding a UIMA TAE in GATE [#]
Embedding a UIMA text analysis engine in a GATE application is a two step process. First, you must construct a mapping descriptor XML file to define how to map annotations between the UIMA CAS and the GATE Document. This mapping file, along with the analysis engine descriptor, is used to instantiate an AnalysisEnginePR which calls the analysis engine on an appropriately initialized CAS. Examples of all the XML files discussed in this section are available in examples/conf under the uima plugin directory.
16.1.1 Mapping File Format [#]
Figure 16.1 shows the structure of a mapping descriptor. The inputs section defines how annotations on the GATE document are transferred to the UIMA CAS. The outputs section defines how annotations which have been added, updated and removed by the TAE are transferred back to the GATE document.
Each input definition takes the following form:
<uimaAnnotation type="uima.Type" gateType="GATEType" indexed="true|false">
<feature name="..." kind="string|int|float|fs">
<!-- element defining the feature value goes here -->
When a document is processed, this will create one UIMA annotation of type uima.Type in the CAS for each GATE annotation of type GATEType in the input annotation set, covering the same offsets in the text. If indexed is true, GATE will keep a record of which GATE annotation gave rise to which UIMA annotation. If you wish to be able to track updates to this annotation’s features and transfer the updated values back into GATE, you must specify indexed="true". The indexed attribute defaults to false if omitted.
Each contained feature element will cause the corresponding feature to be set on the generated annotation. UIMA features can be string, integer or float valued, or can be a reference to another feature structure, and this must be specified in the kind attribute. The feature’s value is specified using a nested element, but exactly how this value is handled is determined by the kind.
There are various options for setting feature values:
- <string value="fixed string" /> The simplest case - a fixed Java String.
- <docFeatureValue name="featureName" /> The value of the given named feature of the current GATE document.
- <gateAnnotFeatureValue name="featureName" /> The value of a given feature on the current GATE annotation (i.e. the one on which the offsets of the UIMA annotation are based).
- <featureStructure type="uima.fs.Type">...</featureStructure> A feature structure of the given type. The featureStructure element can itself contain feature elements recursively.
The value is assigned to the feature according to the feature’s kind:
- The value object’s toString() method is called, and the resulting String is set as the string value of the feature.
- If the value object is a subclass of java.lang.Number, its intValue() method is called, and the result is set as the integer value of the feature. If the value object is not a Number, it is toString()ed, and the resulting String is parsed using Integer.parseInt(). If this succeeds, the integer result is used, if it fails the feature is set to zero.
- As for int, except that Numbers are converted by calling floatValue(), and non-Numbers are parsed using Float.parseFloat().
- The value object is assumed to be a FeatureStructure, and is used as-is. A ClassCastException will result if the value object is not a FeatureStructure.
In particular, <featureStructure> value elements should only be used with features of kind fs. While nothing will stop you using them with string features, the result will probably not be what you expected.
The output definitions take a similar form. There are three groups:
- Annotations which have been added by the TAE, and for which corresponding new annotations are to be created in the GATE document.
- Annotations that were created by an input definition (with indexed="true") whose feature values have been modified by the TAE, and these values are to be transferred back to the original GATE annotations.
- Annotations that were created by an input definition (with indexed="true") which have been removed from the CAS1 and whose source annotations are to be removed from the GATE document.
The definition elements for these three types all take the same form:
<gateAnnotation type="GATEType" uimaType="uima.Type">
<!-- element defining the feature value goes here -->
For added annotations, this has the mirror-image effect to the input definition – for each UIMA annotation of the given type, create a GATE annotation at the same offsets and set its feature values as specified by feature elements. For a gateAnnotation the feature elements do not have a kind, as features in GATE can have arbitrary Objects as values. The possible feature value elements for a gateAnnotation are:
- <string value="fixed string" /> A fixed string, as before.
- <uimaFSFeatureValue name="uima.Type:FeatureName" kind="string|int|float" />
The value of the given feature of the current UIMA annotation. The feature name must be
specified in fully-qualified form, including the type on which it is defined. The kind is used in
a similar way as in input definitions:
- The Java String object returned as the string value of the feature is used.
- An Integer object is created from the integer value of the feature.
- A Float object is created from the float value of the feature.
- The UIMA FeatureStructure object is returned. Since FeatureStructure objects are not guaranteed to be valid once the CAS has been cleared, a downstream GATE component must extract the relevant information from the feature structure before the next document is processed. You have been warned.
Feature names in uimaFSFeatureValue must be qualified with their type name, as the feature may have been defined on a supertype of the feature’s own type, rather than the type itself. For example, consider the following:
<gateAnnotation type="Entity" uimaType="com.example.Entity">
<uimaFSFeatureValue name="com.example.Entity:Type" kind="string" />
<uimaFSFeatureValue name="uima.tcas.Annotation:begin" kind="int" />
For updated annotations, there must have been an input definition with indexed="true" with the same GATE and UIMA types. In this case, for each GATE annotation of the appropriate type, the UIMA annotation that was created from it is found in the CAS. The feature definitions are then used as in the added case, but here, the feature values are set on the original GATE annotation, rather than on a newly created annotation.
For removed annotations, the feature definitions are ignored, and the annotation is removed from GATE if the UIMA annotation which it gave rise to has been removed from the UIMA annotation index.
Figure 16.2 shows a complete example mapping descriptor for a simple UIMA TAE that takes tokens as input and adds a feature to each token giving the number of lower case letters in the token’s string.2 In this case the UIMA feature that holds the number of lower case letters is called LowerCaseLetters, but the GATE feature is called numLower. This demonstrates that the feature names do not need to agree, so long as a mapping between them can be defined.
16.1.2 The UIMA component descriptor [#]
As well as the mapping file, you must provide the UIMA component descriptor that defines how to access the TAE that is to be called. This could be a primitive or aggregate analysis engine descriptor, or a URI specifier giving the location of a remote Vinci or SOAP service. It is up to the developer to ensure that the types and features used in the mapping descriptor are compatible with the type system and capabilities of the TAE, or a runtime error is likely to occur.
16.1.3 Using the AnalysisEnginePR [#]
To use a UIMA TAE in GATE, load the uima plugin and create a “UIMA Analysis Engine” processing resource. If using the GATE framework rather than the GUI, the class name is gate.uima.AnalysisEnginePR. The processing resource expects two parameters:
- The URL of the UIMA analysis engine descriptor (or URI specifier, for a remote TAE service). This must be a file: URL, as UIMA needs a file path against which to resolve imports.
- The URL of the mapping descriptor file. This may be any kind of URL (file:, http:, Class.getResource(), ServletContext.getResource(), etc.)
Any errors processing either of the descriptor files will cause an exception to be thrown. Once instantiated, you can add the PR to a pipeline in the usual way. AnalysisEnginePR implements LanguageAnalyser, so can be used in any of the standard GATE pipeline types.
The PR takes the following runtime parameter (in addition to the document parameter which is set automatically by a CorpusController):
- The annotation set to process. Any input mappings take annotations from this set, and any output mappings place their new annotations in this set (added outputs) or update the input annotations in this set (updated or removed). If not specified, the default (unnamed) annotation set is used.
The Annotator implementation must be available for GATE to load. For an annotator written in Java, this means that the JAR file containing the annotator class (and any other classes it depends on) must be present in the GATE classloader. The easiest way to achieve this is to put the JAR file or files in a new directory, and create a creole.xml file in the same directory to reference the JARs:
This directory should then be loaded in GATE as a CREOLE plugin. Note that, due to the complex mechanics of classloaders in Java, putting your JARs in GATE’s lib directory will not work.
For annotators written in C++ you need to ensure that the C++ enabler libraries (available separately from http://alphaworks.ibm.com/tech/uima/) and the shared library containing your annotator are in a directory which is on the PATH (Windows) or LD_LIBRARY_PATH (Linux) when GATE is run.
16.1.4 Current limitations [#]
If you are using Java 5.0 you may get a NullPointerException or NoClassDefFoundError from the UIMA XML parser when parsing the analysis engine descriptor. This can be fixed by copying xml.jar from the lib directory into GATE_HOME/lib.
The process of embedding a GATE controller in a UIMA application is more or less the mirror image of the process detailed in the previous section. Again, the developer must supply a mapping descriptor defining how to map between UIMA and GATE annotations, and pass this, plus the GATE controller definition, to a TAE which performs the translation and calls the GATE controller.
16.2.1 Mapping file format [#]
The mapping descriptor format is virtually identical to that described in section 16.1.1, except that the input definitions are <gateAnnotation> elements and the output definitions are <uimaAnnotation> elements. The input and output definition elements support an extra attribute, annotationSetName, which allows inputs to be taken from, and outputs to be placed in, different annotation sets. For example, the following hypothetical example maps com.example.Person annotations into the default set and com.example.html.Anchor annotations to “a” tags in the “Original markups” set.
<gateAnnotation type="Person" uimaType="com.example.Person">
<uimaFSFeatureValue name="com.example.Person:Kind" kind="string"/>
<gateAnnotation type="a" annotationSetName="Original markups"
<uimaFSFeatureValue name="com.example.html.Anchor:hRef" kind="string" />
Figure 16.3 shows a mapping descriptor for an application that takes tokens and sentences produced by some UIMA component and runs the GATE part of speech tagger to tag them with Penn TreeBank POS tags.3 In the example, no features are copied from the UIMA tokens, but they are still indexed="true" as the POS feature must be copied back from GATE.
16.2.2 The GATE application definition [#]
The GATE application to embed is given as a standard “.gapp file”, as produced by saving the state of an application in the GATE GUI. The .gapp file encodes the information necessary to load the correct plugins and create the various CREOLE components that make up the application. The .gapp file must be fully specified and able to be executed with no user intervention other than pressing the Go button. In particular, all runtime parameters must be set to their correct values before saving the application state. Also, since paths to things like CREOLE plugin directories, resource files, etc. are stored relative to the .gapp file’s location, you must not move the .gapp file to a different directory unless you can keep all the CREOLE plugins it depends on at the same relative locations.
GATEApplicationAnnotator is the UIMA annotator that handles mapping the CAS into a GATE document and back again and calling the GATE controller. There is a template TAE descriptor XML file for the annotator provided in the conf directory. Most of the template file can be used unchanged, but you will need to modify the type system definition and input/output capabilities to match the types and features used in your mapping descriptor. If the mapping descriptor references a type or feature that is not defined in the type system, a runtime error will occur.
The annotator requires two external resources:
- The .gapp file containing the saved application state.
- The mapping descriptor XML file.
These must be bound to suitable URLs, either by editing the resourceManagerConfiguration section of the primitive decriptor, or by supplying the binding in an aggregate descriptor that includes the GATEApplicationAnnotator as one of its delegates.
In addition, you may need to set the following Java system properties:
- The path to the GATE config directory. This defaults to gate-config in the same directory as uima-gate.jar.
- The location of the sitewide gate.xml configuration file. This defaults to gate.uima.configdir/site-gate.xml.
- The location of the user-specific gate.xml configuration file. This defaults to gate.uima.configdir/user-gate.xml.
The default config files are deliberately simplified from the standard versions supplied with GATE, in particular they do not load any plugins automatically (not even ANNIE). All the plugins used by your application are specified in the .gapp file, and will be loaded when the application is loaded, so it is best to avoid loading any others from gate.xml, to avoid problems such as two different versions of the same plugin being loaded from different locations.
In addition to the usual UIMA library JAR files, GATEApplicationAnnotator requires a number of JAR files from the GATE distribution in order to function. In the first instance, you should include gate.jar from GATE’s bin directory, and also all the JAR files from GATE’s lib directory on the classpath. If you use the supplied Ant build file, ant documentanalyser will run the document analyser with this classpath. Depending on exactly which GATE plugins your application uses, you may be able to exclude some of the lib JAR files (for example, you will not need Weka if you do not use the machine learning plugin), but it is safest to start with them all. GATE will load plugin JAR files through its own classloader, so these do not need to be on the classpath.
Note that the GATE lib directory includes a version of the Apache Xerces XML parser. UIMA also includes an XML parser in its xml.jar. If your program generates unexplained XML parsing exceptions, try removing one or other of the XML parsers from the classpath to see if this solves the problem.
1Strictly speaking, removed from the annotation index, as feature structures cannot be removed from the CAS entirely.
2The Java code implementing this AE is in the examples directory of the uima plugin. The AE descriptor and mapping file are in examples/conf.
3The .gapp file implementing this example is in the test/conf directoriy under the uima plugin, along with the mapping file and the TAE descriptor that will run it.