GATE.ac.uk - sale/tao/splitch22.html

Chapter 22
Combining GATE and UIMA [#]

UIMA (Unstructured Information Management Architecture) is a platform for natural language processing, originally developed by IBM but now maintained by the Apache Software Foundation. It has many similarities to the GATE architecture – it represents documents as text plus annotations, and allows users to deﬁne pipelines of analysis engines that manipulate the document (or Common Analysis Structure in UIMA terminology) in much the same way as processing resources do in GATE. The Apache UIMA SDK provides support for building analysis components in Java and C++ and running them either locally on one machine, or deploying them as services that can be accessed remotely. The SDK is available for download from http://incubator.apache.org/uima/.

Clearly, it would be useful to be able to include UIMA components in GATE applications and vice-versa, letting GATE users take advantage of UIMA’s ﬂexible deployment options and UIMA users access JAPE and the many useful plugins already available in GATE. This chapter describes the interoperability layer provided as part of GATE to support this. The UIMA-GATE interoperability layer is based on Apache UIMA 2.2.2. GATE 5.0 and earlier included an implementation based on version 1.2.3 of the pre-Apache IBM UIMA SDK.

The rest of this chapter assumes that you have at least a basic understanding of core UIMA concepts, such as type systems, primitive and aggregate analysis engines (AEs), feature structures, the format of AE XML descriptors, etc. It will probably be helpful to refer to the relevant sections of the UIMA SDK User’s Guide and Reference (supplied with the SDK) alongside this document.

There are two main parts to the interoperability layer:

A wrapper to allow a UIMA Analysis Engine (AE), whether primitive or aggregate, to be used within GATE as a Processing Resource (PR).
A wrapper to allow a GATE processing pipeline (speciﬁcally a CorpusController) to be used within UIMA as an AE.

The two components operate in very similar ways. Given a document in the source form (either a GATE Document or a UIMA CAS), a document in the target form is created with a copy of the source document’s text. Some of the annotations from the source are transferred to the target, according to a mapping deﬁned by the user, and the target component is then run. Finally, some of the annotations on the updated target document are then transferred back to the source, according to the user-deﬁned mapping.

The rest of this document describes this process in more detail. Section 22.1 describes the GATE AE wrapper, and Section 22.2 describes the UIMA CorpusController wrapper.

22.1 Embedding a UIMA AE in GATE [#]

Embedding a UIMA analysis engine in a GATE application is a two step process. First, you must construct a mapping descriptor XML ﬁle to deﬁne how to map annotations between the UIMA CAS and the GATE Document. This mapping ﬁle, along with the analysis engine descriptor, is used to instantiate an AnalysisEnginePR which calls the analysis engine on an appropriately initialized CAS. Examples of all the XML ﬁles discussed in this section are available in examples/conf under the UIMA plugin directory.

22.1.1 Mapping File Format [#]

Figure 22.1 shows the structure of a mapping descriptor. The inputs section deﬁnes how annotations on the GATE document are transferred to the UIMA CAS. The outputs section deﬁnes how annotations which have been added, updated and removed by the AE are transferred back to the GATE document.

<uimaGateMapping>
  <inputs>
    <uimaAnnotation type="..." gateType="..." indexed="true|false">
      <feature name="..." kind="string|int|float|fs">
        <!-- element defining the feature value goes here -->
      </feature>
      ...
    </uimaAnnotation>
  </inputs>

  <outputs>
    <added>
      <gateAnnotation type="..." uimaType="...">
        <feature name="...">
          <!-- element defining the feature value goes here -->
        </feature>
        ...
      </gateAnnotation>
    </added>

    <updated>
      ...
    </updated>

    <removed>
      ...
    </removed>
  </outputs>
</uimaGateMapping>

Figure 22.1: Structure of a mapping descriptor for an AE in GATE

Input Deﬁnitions [#]

Each input deﬁnition takes the following form:

  <uimaAnnotation type="uima.Type" gateType="GATEType" indexed="true|false">
    <feature name="..." kind="string|int|float|fs">
      <!-- element defining the feature value goes here -->
    </feature>
    ...
  </uimaAnnotation>

When a document is processed, this will create one UIMA annotation of type uima.Type in the CAS for each GATE annotation of type GATEType in the input annotation set, covering the same oﬀsets in the text. If indexed is true, GATE will keep a record of which GATE annotation gave rise to which UIMA annotation. If you wish to be able to track updates to this annotation’s features and transfer the updated values back into GATE, you must specify indexed="true". The indexed attribute defaults to false if omitted.

Each contained feature element will cause the corresponding feature to be set on the generated annotation. UIMA features can be string, integer or ﬂoat valued, or can be a reference to another feature structure, and this must be speciﬁed in the kind attribute. The feature’s value is speciﬁed using a nested element, but exactly how this value is handled is determined by the kind.

There are various options for setting feature values:

<string value="fixed string" /> The simplest case - a ﬁxed Java String.
<docFeatureValue name="featureName" /> The value of the given named feature of the current GATE document.
<gateAnnotFeatureValue name="featureName" /> The value of a given feature on the current GATE annotation (i.e. the one on which the oﬀsets of the UIMA annotation are based).
<featureStructure type="uima.fs.Type">...</featureStructure> A feature structure of the given type. The featureStructure element can itself contain feature elements recursively.

The value is assigned to the feature according to the feature’s kind:

string: The value object’s toString() method is called, and the resulting String is set as the string value of the feature.
int: If the value object is a subclass of java.lang.Number, its intValue() method is called, and the result is set as the integer value of the feature. If the value object is not a Number, it is toString()ed, and the resulting String is parsed using Integer.parseInt(). If this succeeds, the integer result is used, if it fails the feature is set to zero.
ﬂoat: As for int, except that Numbers are converted by calling floatValue(), and non-Numbers are parsed using Float.parseFloat().
fs: The value object is assumed to be a FeatureStructure, and is used as-is. A ClassCastException will result if the value object is not a FeatureStructure.

In particular, <featureStructure> value elements should only be used with features of kind fs. While nothing will stop you using them with string features, the result will probably not be what you expected.

Output Deﬁnitions [#]

The output deﬁnitions take a similar form. There are three groups:

added: Annotations which have been added by the AE, and for which corresponding new annotations are to be created in the GATE document.
updated: Annotations that were created by an input deﬁnition (with indexed="true") whose feature values have been modiﬁed by the AE, and these values are to be transferred back to the original GATE annotations.
removed: Annotations that were created by an input deﬁnition (with indexed="true") which have been removed from the CAS¹ and whose source annotations are to be removed from the GATE document.

The deﬁnition elements for these three types all take the same form:

  <gateAnnotation type="GATEType" uimaType="uima.Type">
    <feature name="featureName">
      <!-- element defining the feature value goes here -->
    </feature>
    ...
  </gateAnnotation>

For added annotations, this has the mirror-image eﬀect to the input deﬁnition – for each UIMA annotation of the given type, create a GATE annotation at the same oﬀsets and set its feature values as speciﬁed by feature elements. For a gateAnnotation the feature elements do not have a kind, as features in GATE can have arbitrary Objects as values. The possible feature value elements for a gateAnnotation are:

<string value="fixed string" /> A ﬁxed string, as before.
<uimaFSFeatureValue name="uima.Type:FeatureName" kind="string|int|float" /> The value of the given feature of the current UIMA annotation. The feature name must be speciﬁed in fully-qualiﬁed form, including the type on which it is deﬁned. The kind is used in a similar way as in input deﬁnitions:

string

The Java String object returned as the string value of the feature is used.

int

An Integer object is created from the integer value of the feature.

ﬂoat

A Float object is created from the ﬂoat value of the feature.

fs

The UIMA FeatureStructure object is returned. Since FeatureStructure objects are not guaranteed to be valid once the CAS has been cleared, a downstream GATE component must extract the relevant information from the feature structure before the next document is processed. You have been warned.

Feature names in uimaFSFeatureValue must be qualiﬁed with their type name, as the feature may have been deﬁned on a supertype of the feature’s own type, rather than the type itself. For example, consider the following:

<gateAnnotation type="Entity" uimaType="com.example.Entity">
  <feature name="type">
    <uimaFSFeatureValue name="com.example.Entity:Type" kind="string" />
  </feature>
  <feature name="startOffset">
    <uimaFSFeatureValue name="uima.tcas.Annotation:begin" kind="int" />
  </feature>
</gateAnnotation>

For updated annotations, there must have been an input deﬁnition with indexed="true" with the same GATE and UIMA types. In this case, for each GATE annotation of the appropriate type, the UIMA annotation that was created from it is found in the CAS. The feature deﬁnitions are then used as in the added case, but here, the feature values are set on the original GATE annotation, rather than on a newly created annotation.

For removed annotations, the feature deﬁnitions are ignored, and the annotation is removed from GATE if the UIMA annotation which it gave rise to has been removed from the UIMA annotation index.

A Complete Example [#]

Figure 22.2 shows a complete example mapping descriptor for a simple UIMA AE that takes tokens as input and adds a feature to each token giving the number of lower case letters in the token’s string.² In this case the UIMA feature that holds the number of lower case letters is called LowerCaseLetters, but the GATE feature is called numLower. This demonstrates that the feature names do not need to agree, so long as a mapping between them can be deﬁned.

<uimaGateMapping>
  <inputs>
    <uimaAnnotation type="gate.uima.cas.Token" gateType="Token" indexed="true">
      <feature name="String" kind="string">
        <gateAnnotFeatureValue name="string" />
      </feature>
    </uimaAnnotation>
  </inputs>
  <outputs>
    <updated>
      <gateAnnotation type="Token" uimaType="gate.uima.cas.Token">
        <feature name="numLower">
          <uimaFSFeatureValue name="gate.uima.cas.Token:LowerCaseLetters"
                              kind="int" />
        </feature>
      </gateAnnotation>
    </updated>
  </outputs>
</uimaGateMapping>

Figure 22.2: An example mapping descriptor

22.1.2 The UIMA Component Descriptor [#]

As well as the mapping ﬁle, you must provide the UIMA component descriptor that deﬁnes how to access the AE that is to be called. This could be a primitive or aggregate analysis engine descriptor, or a URI speciﬁer giving the location of a remote Vinci or SOAP service. It is up to the developer to ensure that the types and features used in the mapping descriptor are compatible with the type system and capabilities of the AE, or a runtime error is likely to occur.

22.1.3 Using the AnalysisEnginePR [#]

To use a UIMA AE in GATE Developer, load the UIMA plugin and create a ‘UIMA Analysis Engine’ processing resource. If using the GATE Embedded, rather than GATE Developer, the class name is gate.uima.AnalysisEnginePR. The processing resource expects two parameters:

analysisEngineDescriptor: The URL of the UIMA analysis engine descriptor (or URI speciﬁer, for a remote AE service). This must be a file: URL, as UIMA needs a ﬁle path against which to resolve imports.
mappingDescriptor: The URL of the mapping descriptor ﬁle. This may be any kind of URL (file:, http:, Class.getResource(), ServletContext.getResource(), etc.)

Any errors processing either of the descriptor ﬁles will cause an exception to be thrown. Once instantiated, you can add the PR to a pipeline in the usual way. AnalysisEnginePR implements LanguageAnalyser, so can be used in any of the standard GATE pipeline types.

The PR takes the following runtime parameter (in addition to the document parameter which is set automatically by a CorpusController):

annotationSetName: The annotation set to process. Any input mappings take annotations from this set, and any output mappings place their new annotations in this set (added outputs) or update the input annotations in this set (updated or removed). If not speciﬁed, the default (unnamed) annotation set is used.

The Annotator implementation must be available for GATE to load. For an annotator written in Java, this means that the JAR ﬁle containing the annotator class (and any other classes it depends on) must be present in the GATE classloader. The easiest way to achieve this is to put the JAR ﬁle or ﬁles in a new directory, and create a creole.xml ﬁle in the same directory to reference the JARs:

<CREOLE-DIRECTORY>
  <JAR>my-annotator.jar</JAR>
  <JAR>classes-it-uses.jar</JAR>
</CREOLE-DIRECTORY>

This directory should then be loaded in GATE as a CREOLE plugin. Note that, due to the complex mechanics of classloaders in Java, putting your JARs in GATE’s lib directory will not work.

For annotators written in C++ you need to ensure that the C++ enabler libraries (available separately from http://incubator.apache.org/uima/) and the shared library containing your annotator are in a directory which is on the PATH (Windows) or LD_LIBRARY_PATH (Linux) when GATE is run.

22.2 Embedding a GATE CorpusController in UIMA [#]

The process of embedding a GATE controller in a UIMA application is more or less the mirror image of the process detailed in the previous section. Again, the developer must supply a mapping descriptor deﬁning how to map between UIMA and GATE annotations, and pass this, plus the GATE controller deﬁnition, to an AE which performs the translation and calls the GATE controller.

22.2.1 Mapping File Format [#]

The mapping descriptor format is virtually identical to that described in Section 22.1.1, except that the input deﬁnitions are <gateAnnotation> elements and the output deﬁnitions are <uimaAnnotation> elements. The input and output deﬁnition elements support an extra attribute, annotationSetName, which allows inputs to be taken from, and outputs to be placed in, diﬀerent annotation sets. For example, the following hypothetical example maps com.example.Person annotations into the default set and com.example.html.Anchor annotations to ‘a’ tags in the ‘Original markups’ set.

<inputs>
  <gateAnnotation type="Person" uimaType="com.example.Person">
    <feature name="kind">
      <uimaFSFeatureValue name="com.example.Person:Kind" kind="string"/>
    </feature>
  </gateAnnotation>

  <gateAnnotation type="a" annotationSetName="Original markups"
                  uimaType="com.example.html.Anchor">
    <feature name="href">
      <uimaFSFeatureValue name="com.example.html.Anchor:hRef" kind="string" />
    </feature>
  </gateAnnotation>
</inputs>

Figure 22.3 shows a mapping descriptor for an application that takes tokens and sentences produced by some UIMA component and runs the GATE part of speech tagger to tag them with Penn TreeBank POS tags.³ In the example, no features are copied from the UIMA tokens, but they are still indexed="true" as the POS feature must be copied back from GATE.

<uimaGateMapping>
  <inputs>
    <gateAnnotation type="Token"
                    uimaType="com.ibm.uima.examples.tokenizer.Token"
                    indexed="true" />
    <gateAnnotation type="Sentence"
                    uimaType="com.ibm.uima.examples.tokenizer.Sentence" />
  </inputs>
  <outputs>
    <updated>
      <uimaAnnotation type="com.ibm.uima.examples.tokenizer.Token"
                      gateType="Token">
        <feature name="POS" kind="string">
          <gateAnnotFeatureValue name="category" />
        </feature>
      </uimaAnnotation>
    </updated>
  </outputs>
</uimaGateMapping>

Figure 22.3: An example mapping descriptor for the GATE POS tagger

22.2.2 The GATE Application Deﬁnition [#]

The GATE application to embed is given as a standard ‘.gapp ﬁle’, as produced by saving the state of an application in the GATE GUI. The .gapp ﬁle encodes the information necessary to load the correct plugins and create the various CREOLE components that make up the application. The .gapp ﬁle must be fully speciﬁed and able to be executed with no user intervention other than pressing the Go button. In particular, all runtime parameters must be set to their correct values before saving the application state. Also, since paths to things like CREOLE plugin directories, resource ﬁles, etc. are stored relative to the .gapp ﬁle’s location, you must not move the .gapp ﬁle to a diﬀerent directory unless you can keep all the CREOLE plugins it depends on at the same relative locations. The ‘Export for GATE Cloud’ option (section 3.9.4) may help you here.

22.2.3 Conﬁguring the GATEApplicationAnnotator

GATEApplicationAnnotator is the UIMA annotator that handles mapping the CAS into a GATE document and back again and calling the GATE controller. There is a template AE descriptor XML ﬁle for the annotator provided in the conf directory. Most of the template ﬁle can be used unchanged, but you will need to modify the type system deﬁnition and input/output capabilities to match the types and features used in your mapping descriptor. If the mapping descriptor references a type or feature that is not deﬁned in the type system, a runtime error will occur.

The annotator requires two external resources:

GateApplication: The .gapp ﬁle containing the saved application state.
MappingDescriptor: The mapping descriptor XML ﬁle.

These must be bound to suitable URLs, either by editing the resourceManagerConfiguration section of the primitive descriptor, or by supplying the binding in an aggregate descriptor that includes the GATEApplicationAnnotator as one of its delegates.

In addition, you may need to set the following Java system properties:

uima.gate.conﬁgdir: The path to the GATE conﬁg directory. This defaults to gate-config in the same directory as uima-gate.jar.
uima.gate.siteconﬁg: The location of the sitewide gate.xml conﬁguration ﬁle. This defaults to gate.uima.configdir/site-gate.xml.
uima.gate.userconﬁg: The location of the user-speciﬁc gate.xml conﬁguration ﬁle. This defaults to gate.uima.configdir/user-gate.xml.

The default conﬁg ﬁles are deliberately simpliﬁed from the standard versions supplied with GATE, in particular they do not load any plugins automatically (not even ANNIE). All the plugins used by your application are speciﬁed in the .gapp ﬁle, and will be loaded when the application is loaded, so it is best to avoid loading any others from gate.xml, to avoid problems such as two diﬀerent versions of the same plugin being loaded from diﬀerent locations.

Classpath Notes

In addition to the usual UIMA library JAR ﬁles, GATEApplicationAnnotator requires a number of JAR ﬁles from the GATE distribution in order to function. In the ﬁrst instance, you should include gate.jar from GATE’s bin directory, and also all the JAR ﬁles from GATE’s lib directory on the classpath. If you use the supplied Ant build ﬁle, ant documentanalyser will run the document analyser with this classpath. Depending on exactly which GATE plugins your application uses, you may be able to exclude some of the lib JAR ﬁles (for example, you will not need Weka if you do not use the machine learning plugin), but it is safest to start with them all. GATE will load plugin JAR ﬁles through its own classloader, so these do not need to be on the classpath.

¹Strictly speaking, removed from the annotation index, as feature structures cannot be removed from the CAS entirely.

²The Java code implementing this AE is in the examples directory of the UIMA plugin. The AE descriptor and mapping ﬁle are in examples/conf.

³The .gapp ﬁle implementing this example is in the test/conf directory under the UIMA plugin, along with the mapping ﬁle and the AE descriptor that will run it.

[next] [prev] [prev-tail] [front] [up]