The Montreal Transducer module for GATE
User guide
plamondl@iro.umontreal.ca
$Id$
Table of contents
- What is GATE
- What is the Montreal Transducer?
- Getting help
- Installation procedure
- How to use it with the GATE GUI?
- How to use it in a standalone GATE program?
- Changes to the JAPE language
- For developers
- Licence
- Change log
1) What is GATE?
GATE is a development environment for language engineering. It is open source and it can be downloaded from http://gate.ac.uk. The processing of a document is divided into small tasks that are performed by independent JavaBeans modules. The Montreal Transducer is one of those modules.2) What is the Montreal Transducer?
A transducer has 2 inputs: a document and a human-readable grammar. Generally, the output is a document with annotations added according to the grammar, but it could be anything else because the grammar allows Java code to be executed upon the parsing of a rule. A transducer can be used to identify named entities in a document, for example.The GATE framework comes with a basic "Jape Transducer" which is fully described in the Gate user guide. The JAPE grammar language understood by the transducer is also explained. There is also an "Ontology Aware Transducer" that is a wrapper around the Jape Transducer (in fact, the latter's core is already ontology aware). And there is a "ANNIE Transducer" that is nothing more than a Jape Transducer that loads with a named-entity recognition grammar.
The Montreal Transducer is an improved Jape Transducer. It is intended to make grammar authoring easier by providing a more flexible version of the JAPE language and it also fixes a few bugs.
If you write JAPE grammars, see section Changes to the JAPE language for all the details. Otherwise, here is a short description of the enhancements:
a) The improvements
- While only '==' constraints were allowed on annotation attributes, the grammar now accepts constraints such as {MyAnnot.attrib != value}, {MyAnnot.attrib > value}, {MyAnnot.attrib < value}, {MyAnnot.attrib =~ value} and {MyAnnot.attrib !~ value}
- The grammar now accepts negated constraints such as {!MyAnnot} (true if no annotation starting from current node has the MyAnnot type) and {!MyAnnot.attrib == value} (true if {MyAnnot.attrib == value} fails), where the '==' constraint can be any other operator
- Because the transducer compiles rules at run-time, the classpath must include the transducer jar file (unless the transducer is bundled in the GATE jar file). The Montreal Transducer updates the classpath automatically when it is initialised.
b) The bugs fixed
- Constraints on more than one annotation types for a same node now work. For example, {MyAnnot1, MyAnnot2} was allowed by the Jape Transducer but not implemented yet
- The "*" and "+" Kleene operators were not greedy when they occurred inside a rule. The document region parsed by a rule is correct but ambiguous labels inside the rule were not resolved the expected way. In the following rule for example, a node that would match both constraints should be part of the ":titles" label and not ":names" because the first "+" is expected to be greedy:
3) Getting help
The reader should be familiar with the Jape language. See the Gate user guide, more specifically section JAPE: Regular Expressions Over Annotations and appendix JAPE: Implementation.The Montreal Transducer sources are freely available, so user support will be very limited. You may find what you are looking for on the project homepage.
Developers will find comments on classes and methods through the javadoc pages: doc/javadoc/index.html.
4) Installation procedure
Java 1.4 or higher is required. The Montreal Transducer has been tested on GATE 2.1, 2.2 and 3.0. If you are using GATE 2.x, put the MtlTransducer.jar and creole.xml files in any directory (as long as they are in the same directory). If you are using GATE 3.0, put the 2 files in your plugin directory (more about plugins in the Gate user guide, section Use (CREOLE) Plug-ins).Note that the directory must be accessible by the embedding application via the "file:" protocol. Unlike for most GATE modules, the directory (also known as a repository in GATE 2.x) of a transducer cannot be a web URL ("http://www..."). This is because the transducer compiles java code (the grammar rules) every time it is loaded and the resource jar file must be part of the classpath when compiling, but only regular file URLs are allowed in the classpath. The resource will try to add the jar file to the classpath automatically.
If problems arise when loading the transducer, add the jar file to the classpath manually prior to running the application.
If you plan to use the transducer with the GATE GUI, see section How to use it with the GATE GUI. If you plan to use it in a standalone program, jump to section How to use it in a standalone GATE program.
5) How to use it with the GATE GUI
Gate 2.x: In the GUI menu, click on File / Load a CREOLE Repository, then enter the URL of the directory where MtlTransducer.jar and creole.xml files live. The path must begin with "file:". It cannot be a web URL (see Installation procedure).Gate 3.0: In the GUI menu, click on File / Manage CREOLE plugins, find the Montreal Transducer and tick the "Load now" or "Load always" box.
Then, for all versions of GATE: Click on File / New processing resource and choose Montreal Transducer. The only mandatory field is the Grammar URL: enter the path of a main.jape file in the same manner as for a regular Jape Transducer (this URL can point to a file on the web). Add the new module to a processing pipeline. It may be necessary to run a tokeniser and gazetteer before the transducer if the grammar uses Token and Lookup annotations.
6) How to use it in a standalone GATE program?
Note: this section was written for GATE 2.x. If you are using GATE 3.0, repository management (setting the plugin directory) may work differently.A good starting point is the example code here. The following code registers a repository (the directory where the MtlTransducer.jar and creole.xml files live; the directory cannot be a web URL, see Installation procedure), then creates a Montreal Transducer with specific parameters (the grammarURL parameter is mandatory and it should point to a main.jape file like for a regular Jape Transducer), and finally adds the resource to a pipeline. It may be necessary to run a tokeniser and gazetteer before the transducer if the grammar uses Token and Lookup annotations.
// Create a pipeline
SerialAnalyserController annieController = (SerialAnalyserController)
Factory.createResource("gate.creole.SerialAnalyserController",
Factory.newFeatureMap(), Factory.newFeatureMap(),
"ANNIE_" + Gate.genSym());
// Load a tokeniser, gazetteer, etc. here
// Register the external repository where the Montreal Transducer
jar file lives
gate.Gate.getCreoleRegister().registerDirectories(new URL("file:MtlTransducer/build"));
// Create an instance of the transducer after having set the grammar
URL
FeatureMap params;
params = Factory.newFeatureMap();
params.put("grammarURL", new URL("file:creole/NE/main.jape"));
params.put("inputASName", "Original markups");
ProcessingResource transducerPR = (ProcessingResource)
Factory.createResource("ca.umontreal.iro.rali.gate.MtlTransducer",
params);
annieController.add(transducerPR);
7) Changes to the JAPE language
The Montreal Transducer is based on the Transducer from the ANNIE suite but with the following added features:- It provides more comparison operators in left hand side constraints
- It allows conjunctions of constraints on different types of annotation
- It guarantees that the "*" and "+" Kleene operators are greedy
The Montreal Transducer offers more comparison operators to put in left hand side constraints of a JAPE grammar. The standard ANNIE transducer allows constraints only like these:
- {MyAnnot} // true if the current annotation is a MyAnnot annotation
- {MyAnnot.attrib == "3"} // true if attrib attribute has a value that is equal to 3
- {!MyAnnot} // true if NO annotation at current point is a MyAnnot
- {!MyAnnot.attrib == 3} // true if attrib is not equal to 3
- {MyAnnot.attrib != 3} // true if attrib is not equal to 3
- {MyAnnot.attrib > 3} // true if attrib > 3
- {MyAnnot.attrib >= 3} // true if attrib ≥ 3
- {MyAnnot.attrib < 3} // true if attrib < 3
- {MyAnnot.attrib <= 3} // true if attrib ≤ 3
- {MyAnnot.attrib =~ "[Dd]ogs?"} // true if regular expression matches attrib entirely
- {MyAnnot.attrib !~ "[Dd]ogs?"} // true if regular expression does not match attrib
Notes on equality operators: "==" and "!="
The "!=" operator is the negation of the "==" operator, that is to say: {Annot.attribute != value} is equivalent to {!Annot.attribute == value}.
When a constraint on an attribute cannot be evaluated because an annotation does not have a value for the attribute, the equality operator returns false (and the difference operator returns true).
If the constraint's attribute is a string, then the String.equals method is called with the annotation's attribute as a parameter. If the constraint's attribute is an integer, then the Long.equals method is called. If the constraint's attribute is a float, then the Double.equals method is called. And if the constraint's attribute is a boolean, then the Boolean.equals method is called. The grammar parser does not allow other types of constraints.
Normally, when the types of the constraint's and the annotation's attribute differ, they cannot be equal. However, because some ANNIE processing resources (namely the tokeniser) set all attribute values as strings even when they are numbers (Token.length is set to a string value, for example), the Montreal Transducer can convert the string to a Long/Double/Boolean before testing for equality. In other words, for the token "dog":
- {Token.attrib == "3"} is true using either the ANNIE transducer or the Montreal Transducer
{Token.attrib == 3}
is false using the ANNIE transducer, but true using the Montreal Transducer
If the constraint's attribute is a string, then the String.compareTo method is called with the annotation's attribute as a parameter (strings can be compared alphabetically). If the constraint's attribute is an integer, then the Long.compareTo method is called. If the constraint's attribute is a float, then the Double.compareTo method is called. The transducer issues a warning if an attempt is made to compare two Boolean because this type does not extend the Comparable interface and thus has no compareTo method.
The transducer issues a warning when it encounters an annotation's attribute that cannot be compared to the constraint's attribute because the value types are different, or because one value is null. For example, given a constraint {MyAnnot.attrib > 2}, a warning is issued for any MyAnnot in the document for which attrib is not an integer, such as attrib = "dog" because we cannot evaluate "dog" > 2. Similarly, {MyAnnot.attrib > 2} cannot be compared to attrib = 2.5 because 2.5 is a float. In this case, force 2 as a float with {MyAnnot.attrib > 2.0}.
The transducer does not issue a warning when the constraint's attribute is an integer/float and the annotation's attribute is a string but can be parsed as an integer/float. Some ANNIE processing resources (namely the tokeniser) set all attribute values as strings even when they are numbers (Token.length is set to a string value, for example), and because {Token.length < "10"} would lead to an alphabetical comparison, a workaround was needed so we could write {Token.length < 10}.
Notes on pattern matching operators: "=~" and "!~"
The "!~" operator is the negation of the "=~" operator, that is to say: {Annot.attribute !~ "value"} is equivalent to {!Annot.attribute =~ "value"}.
When a constraint on an attribute cannot be evaluated because an annotation does not have a value for the attribute, the value defaults to an empty string ("").
The regular expression must be enclosed in double quotes, otherwise the transducer issues a warning:
- {MyAnnot.attrib =~ "[Dd]ogs?"} is correct
- {MyAnnot.attrib =~ 2} is incorrect
To have a match, the regular expression must cover the entire attribute string, not only a part of it. For example:
- {MyAnnot.attrib =~ "do"} does not match "does"
- {MyAnnot.attrib =~ "do.*"} matches "does"
Bindings: when a constraint contains both negated and regular elements, the negated elements do not affect the bindings of the regular elements. Thus, {Person, !Organization} binds to the same annotations (amongst those that starts at current node in the annotation graph) as {Person}; the difference between the two is that the first will simply not match if one of the annotations starting at current node is an Organization. On the other hand, when a constraint contains only negated elements such as {!Organization}, it binds to all annotations starting at current node. It is important to keep that in mind especially when a rule ends with a constraint with negated elements only: the longest annotation at current node will be preferred.
Conjunctions of constraints on different types of annotation
The Montreal Transducer allows constraints on different types of annotation. Though the JAPE implementation exposed in the GATE 2.1 User Guide details an algorithm that would allow such constraints, the ANNIE transducer does not implement it. This transducer does. Those examples do not work as expected with the ANNIE transducer but do with this transducer:
- {Person, Organization}
- {Person, Organization, Token.length == "10"}
- {Person, !Organization}
Greedy Kleene operators: "*" and "+"
The ANNIE transducer does not behave consistently regarding the "*" and "+" Kleene operators. Suppose we have the following rule with 2 bindings:
- ({Lookup.majorType == title})+:titles ({Token.orth == upperInitial})+:names
- titles: "Honourable Mr."
- names: "John Atkinson"
- titles: "Honourable"
- names: "Mr. John Atkinson"
8) For developers
Developers will find comments on classes and methods through the javadoc pages: doc/javadoc/index.html. Most of the source code comes from the Jape Transducer in GATE. It was necessary to copy entire packages instead of overriding a few methods because many class attributes and members were not accessible outside the gate.xxx package. The Montreal Transducer needs 4 packages:a) ca.umontreal.iro.rali.gate.creole
Contains only the MtlTransducer class, which is the module's interface with the outside world. The MtlTransducer class is almost exactly the same as gate.creole.Transducer (the basic Jape Transducer). The code of OntologyAwareTransducer is also included in MtlTransducer. It was impossible to simply extend any of those transducers because some members are private or package-protected.b) ca.umontreal.iro.rali.gate.fsm
Same as the gate.fsm package. This package models the grammar as a finite state machine. Only the convertComplexPE private method of the FSM class has been substantially modified.c) ca.umontreal.iro.rali.gate.jape
Almost the same as the gate.jape package. Significant modifications were made to the SinglePhaseTransducer, Constraint and JdmAttribute classes.d) ca.umontreal.iro.rali.gate.jape.parser
Almost the same as gate.jape.parser package. Modifications were made to ParseCpsl.jj so that the JAPE language could be extended. This file is to be compiled with javacc. The other classes of the package are automatically generated by javacc.9) Licence
This work is a modification of some GATE libraries and therefore the binaries and source code are distributed under the same licence as GATE itself. GATE is licenced under the GNU Library General Public License, version 2 of June 1991. That licence is distributed with this module in the file LICENCE.htm. GATE binaries and source code are available at http://gate.ac.uk. Modifications to the original source code are detailed in the header of each file.Basically, the Montreal Transducer source code and binaries are free. A work that would be a modification of it should also be free. However, a work that would only USE the Montreal Transducer would be exempted from the terms of the licence, provided the GATE and the Montreal Transducer binaries, source code and licence are distributed with the embedding work and provided the use of those softwares is acknowledged. For additional help on the interpretation of the GATE licence, see http://www.gate.ac.uk/gate/doc/index.html.
10) Change log
1.2:- Updated documentation to address GATE 3.0 plugin management.
1.1:
- Bug fixed: a constraint with multiple negated tests on the same attribute
of a given annotation type would match when at least one test succeeds,
but it should match only when ALL negated tests succeed.
1.0:
- Initial release.