Log in Help
Print
Homereleasesgate-5.0-build3244-ALLdoc 〉 tao
 

NLP GATE



Developing Language Processing Components with GATE
Version 5 (a User Guide)
  For GATE version 5.0
  (built May 28, 2009)


  Hamish Cunningham
  Diana Maynard
  Kalina Bontcheva
  Valentin Tablan
  Cristian Ursu
  Marin Dimitrov
  Mike Dowman
  Niraj Aswani
  Ian Roberts
  Yaoyong Li
  Andrey Shafirin
  Adam Funk

  ©The University of Sheffield 2001-2009

  http://gate.ac.uk/

PDF version

Single HTML page

Multiple HTML pages


Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT) and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), and several EU-funded projects (SEKT, TAO, NeOn, MediaCampaign, MUSING, KnowledgeWeb, PrestoSpace, h-TechSight, enIRaF).

Contents

1 Introduction
 1.1 How to Use This Text
 1.2 Context
 1.3 Overview
  1.3.1 Developing and Deploying Language Processing Facilities
  1.3.2 Built-in Components
  1.3.3 Additional Facilities
  1.3.4 An Example
 1.4 Structure of the Book
 1.5 Further Reading
2 Change Log
 2.1 Version 5.0 (May 2009)
  2.1.1 Major new features
  2.1.2 Other new features and improvements
  2.1.3 Specific bug fixes
 2.2 Version 4.0 (July 2007)
  2.2.1 Major new features
  2.2.2 Other new features and improvements
  2.2.3 Bug fixes and optimizations
 2.3 Version 3.1 (April 2006)
  2.3.1 Major new features
  2.3.2 Other new features and improvements
  2.3.3 Bug fixes
 2.4 January 2005
 2.5 December 2004
 2.6 September 2004
 2.7 Version 3 Beta 1 (August 2004)
 2.8 July 2004
 2.9 June 2004
 2.10 April 2004
 2.11 March 2004
 2.12 Version 2.2 – August 2003
 2.13 Version 2.1 – February 2003
 2.14 June 2002
3 How To…
 3.1 Download GATE*
 3.2 Install and Run GATE*
  3.2.1 The Easy Way
  3.2.2 The Hard Way (1)
  3.2.3 The Hard Way (2): Subversion
 3.3 [D,F] Use System Properties with GATE
 3.4 [D,F] Use (CREOLE) Plug-ins
 3.5 Troubleshooting
 3.6 [D] Get Started with the GUI*
 3.7 [D,F] Configure GATE
  3.7.1 [F] Save Config Data to gate.xml
 3.8 Build GATE
 3.9 [D] Use GATE with Maven or JPF
 3.10 [D,F] Create a New (CREOLE) Resource
 3.11 [F] Instantiate (CREOLE) Resources
 3.12 [D] Load Resources: document, tokenizer...*
  3.12.1 Loading Language Resources: document, corpora...
  3.12.2 Loading Processing Resources: tokenizer, gazetteer...
  3.12.3 Loading and Processing Large Corpora
 3.13 [D,F] Configure (CREOLE) Resources
 3.14 [D] Create and Run an Application*
 3.15 [D] Run PRs Conditionally on Document Features
 3.16 [D] View Annotations*
 3.17 [D] Do Information Extraction with ANNIE*
 3.18 [D] Modify ANNIE
 3.19 [D] Create and Edit Annotations*
  3.19.1 Schema-driven editing
 3.20 [D] Saving annotations*
 3.21 [D,F] Create a New Annotation Schema
 3.22 [D] Save and Restore LRs in Data Stores
 3.23 [D] Save Resource Parameter State to File
 3.24 [D] Save an application with its resources (e.g. GATE Teamware)
 3.25 [D,F] Perform Evaluation with the AnnotationDiff tool
 3.26 [D] Use the Corpus Benchmark Evaluation tool
  3.26.1 GUI mode
  3.26.2 How to define the properties of the benchmark tool
 3.27 [D] Write JAPE Grammars
 3.28 [F] Embed NLE in other Applications
 3.29 [F] Use GATE within a Spring application
 3.30 [F] Use GATE within a Tomcat Web Application
  3.30.1 Recommended Directory Structure
  3.30.2 Configuration files
  3.30.3 Initialization code
 3.31 [F] Use GATE in a Multithreaded Environment
 3.32 [D,F] Add support for a new document format
 3.33 [D] Dump Results to File
 3.34 [D] Stop GUI ‘Freezing’ on Linux
 3.35 [D] Stop GUI Crashing on Linux
 3.36 [D] Stop GATE Restoring GUI Sessions/Options
 3.37 Work with Unicode
 3.38 Work with Oracle and PostgreSQL
 3.39 Annotate using ontologies
4 CREOLE: the GATE Component Model
 4.1 The Web and CREOLE
 4.2 Java Beans: a Simple Component Architecture
 4.3 The GATE Framework
 4.4 Language Resources and Processing Resources
 4.5 The Lifecycle of a CREOLE Resource
 4.6 Processing Resources and Applications
 4.7 Language Resources and Datastores
 4.8 Built-in CREOLE Resources
 4.9 CREOLE Resource Configuration
  4.9.1 Configuration with XML
  4.9.2 Configuring resources using annotations
  4.9.3 Mixing the configuration styles
5 Visual CREOLE
 5.1 Gazetteer Visual Resource - GAZE
  5.1.1 Running Modes
  5.1.2 Loading a Gazetteer
  5.1.3 Linear Definition Pane
  5.1.4 Linear Definition Toolbar
  5.1.5 Operations on Linear Definition Nodes
  5.1.6 Gazetteer List Pane
  5.1.7 Mapping Definition Pane
 5.2 Ontogazetteer
  5.2.1 Gazetteer Lists Editor and Mapper
  5.2.2 Ontogazetteer Editor
 5.3 The Document Editor
  5.3.1 The Annotation Sets View
  5.3.2 The Annotations List View
  5.3.3 The Co-reference Editor
6 Language Resources: Corpora, Documents and Annotations
 6.1 Features: Simple Attribute/Value Data
 6.2 Corpora: Sets of Documents plus Features
 6.3 Documents: Content plus Annotations plus Features
 6.4 Annotations: Directed Acyclic Graphs
  6.4.1 Annotation Schemas
  6.4.2 Examples of Annotated Documents
  6.4.3 Creating, Viewing and Editing Diverse Annotation Types
 6.5 Document Formats
  6.5.1 Detecting the right reader
  6.5.2 XML
  6.5.3 HTML
  6.5.4 SGML
  6.5.5 Plain text
  6.5.6 RTF
  6.5.7 Email
 6.6 XML Input/Output
7 JAPE: Regular Expressions Over Annotations
 7.1 Matching operators in detail
  7.1.1 Equality operators (“==” and “!=”)
  7.1.2 Comparison operators (“<”, “<=”, “>=” and “>”)
  7.1.3 Regular expression operators (“=~”, “==~”, “!~” and “!=~”)
  7.1.4 Contextual operators (“contains” and “within”)
 7.2 Use of Context
 7.3 Use of Priority
 7.4 Use of negation
 7.5 Useful tricks
 7.6 Ontology aware grammar transduction
 7.7 Using Java code in JAPE rules
  7.7.1 Adding a feature to the document
  7.7.2 Using named blocks
  7.7.3 Java RHS overview
 7.8 Optimising for speed
 7.9 Serializing JAPE Transducer
  7.9.1 How to serialize?
  7.9.2 How to use the serialized grammar file?
 7.10 The JAPE Debugger
  7.10.1 Debugger GUI
  7.10.2 Using the Debugger
  7.10.3 Known Bugs
 7.11 Notes for Montreal Transducer users
8 ANNIE: a Nearly-New Information Extraction System
 8.1 Tokeniser
  8.1.1 Tokeniser Rules
  8.1.2 Token Types
  8.1.3 English Tokeniser
 8.2 Gazetteer
 8.3 Sentence Splitter
 8.4 RegEx Sentence Splitter
 8.5 Part of Speech Tagger
 8.6 Semantic Tagger
 8.7 Orthographic Coreference (OrthoMatcher)
  8.7.1 GATE Interface
  8.7.2 Resources
  8.7.3 Processing
 8.8 Pronominal Coreference
  8.8.1 Quoted Speech Submodule
  8.8.2 Pleonastic It submodule
  8.8.3 Pronominal Resolution Submodule
  8.8.4 Detailed description of the algorithm
 8.9 A Walk-Through Example
  8.9.1 Step 1 - Tokenisation
  8.9.2 Step 2 - List Lookup
  8.9.3 Step 3 - Grammar Rules
9 (More CREOLE) Plugins
 9.1 Document Reset
 9.2 Verb Group Chunker
 9.3 Noun Phrase Chunker
  9.3.1 Differences from the Original
  9.3.2 Using the Chunker
 9.4 OntoText Gazetteer
  9.4.1 Prerequisites
  9.4.2 Setup
 9.5 Flexible Gazetteer
 9.6 Gazetteer List Collector
 9.7 Tree Tagger
  9.7.1 POS tags
 9.8 Stemmer
  9.8.1 Algorithms
 9.9 GATE Morphological Analyzer
  9.9.1 Rule File
 9.10 MiniPar Parser
  9.10.1 Platform Supported
  9.10.2 Resources
  9.10.3 Parameters
  9.10.4 Prerequisites
  9.10.5 Grammatical Relationships
 9.11 RASP Parser
 9.12 SUPPLE Parser (formerly BuChart)
  9.12.1 Requirements
  9.12.2 Building SUPPLE
  9.12.3 Running the parser in GATE
  9.12.4 Viewing the parse tree
  9.12.5 System properties
  9.12.6 Configuration files
  9.12.7 Parser and Grammar
  9.12.8 Mapping Named Entities
  9.12.9 Upgrading from BuChart to SUPPLE
 9.13 Stanford Parser
  9.13.1 Input requirements
  9.13.2 Initialization parameters
  9.13.3 Runtime parameters
 9.14 Montreal Transducer
  9.14.1 Main Improvements
  9.14.2 Main Bug fixes
 9.15 Language Plugins
  9.15.1 French Plugin
  9.15.2 German Plugin
  9.15.3 Romanian Plugin
  9.15.4 Arabic Plugin
  9.15.5 Chinese Plugin
  9.15.6 Hindi Plugin
 9.16 Chemistry Tagger
  9.16.1 Using the tagger
 9.17 Flexible Exporter
 9.18 Annotation Set Transfer
 9.19 Information Retrieval in GATE
  9.19.1 Using the IR functionality in GATE
  9.19.2 Using the IR API
 9.20 Crawler
  9.20.1 Using the Crawler PR
 9.21 Google Plugin
  9.21.1 Using the GooglePR
 9.22 Yahoo Plugin
  9.22.1 Using the YahooPR
 9.23 WordNet in GATE
  9.23.1 The WordNet API
 9.24 Machine Learning in GATE
  9.24.1 ML Generalities
  9.24.2 The Machine Learning PR in GATE
  9.24.3 The WEKA Wrapper
  9.24.4 Training an ML model with the ML PR and WEKA wrapper
  9.24.5 Applying a learnt model
  9.24.6 The MAXENT Wrapper
  9.24.7 The SVM Light Wrapper
 9.25 MinorThird
 9.26 MIAKT NLG Lexicon
  9.26.1 Complexity and Generality
 9.27 Kea - Automatic Keyphrase Detection
  9.27.1 Using the “KEA Keyphrase Extractor” PR
  9.27.2 Using Kea corpora
 9.28 Ontotext JapeC Compiler
 9.29 ANNIC
  9.29.1 Instantiating SSD
  9.29.2 Search GUI
  9.29.3 Using SSD from your code
 9.30 Annotation Merging
  9.30.1 Two implemented methods
  9.30.2 Annotation Merging Plugin
 9.31 OntoRoot Gazetteer
  9.31.1 How does it work?
  9.31.2 Initialisation of OntoRoot Gazetteer
 9.32 Chinese Word Segmentation
 9.33 Copying Annotations Between Documents
10 Working with Ontologies
 10.1 Data Model for Ontologies
  10.1.1 Hierarchies of classes and restrictions
  10.1.2 Instances
  10.1.3 Hierarchies of properties
 10.2 Ontology Event Model (new in Gate 4)
  10.2.1 What happens when a resource is deleted?
 10.3 OWLIM Ontology LR
 10.4 GATE’s Ontology Editor
 10.5 Instantiating OWLIM Ontology using GATE API
 10.6 Ontology-Aware JAPE Transducer
 10.7 Annotating text with Ontological Information
 10.8 Populating Ontologies
 10.9 Ontology Annotation Tool
  10.9.1 Viewing Annotated Texts
  10.9.2 Editing Existing Annotations
  10.9.3 Adding New Annotations
  10.9.4 Options
11 Machine Learning API
 11.1 ML Generalities
  11.1.1 Some definitions
  11.1.2 GATE-specific interpretation of the above definitions
 11.2 The Batch Learning PR in GATE
  11.2.1 The settings not specified in the configuration file
  11.2.2 All the settings in the XML configuration file
 11.3 Examples of configuration file for the three learning types
 11.4 How to use the ML API
 11.5 The outputs of the ML API
  11.5.1 Training results
  11.5.2 Application results
  11.5.3 Evaluation results
  11.5.4 Feature files
12 Tools for Alignment Tasks
 12.1 Introduction
 12.2 Tools for Alignment Tasks
  12.2.1 Compound Document
  12.2.2 Compound Document Editor
  12.2.3 Composite Document
  12.2.4 DeleteMembersPR
  12.2.5 SwitchMembersPR
  12.2.6 Saving as XML
  12.2.7 Alignment Editor
13 Performance Evaluation of Language Analysers
 13.1 The AnnotationDiff Tool
 13.2 The six annotation relations explained
 13.3 Benchmarking tool
 13.4 Metrics for Evaluation in Information Extraction
 13.5 Metrics for Evaluation of Inter-Annotator Agreement
 13.6 A Plugin Computing Inter-Annotator Agreement (IAA)
  13.6.1 IAA for Classification Task
  13.6.2 IAA For Named Entity Annotation
  13.6.3 The BDM Based IAA Scores
 13.7 A Plugin Computing the BDM Scores for an Ontology
14 Users, Groups, and LR Access Rights
 14.1 Java serialisation and LR access rights
 14.2 Oracle Datastore and LR access rights
  14.2.1 Users, Groups, Sessions and Access Modes
  14.2.2 User/Group Administration
  14.2.3 The API
15 Developing GATE
 15.1 Creating new plugins
  15.1.1 Where to keep plugins in the GATE hierarchy
  15.1.2 Writing a new PR
  15.1.3 Writing a new VR
  15.1.4 Adding plugins to the nightly build
 15.2 Updating this User Guide
  15.2.1 Building the User Guide
  15.2.2 Making changes to the User Guide
16 Combining GATE and UIMA
 16.1 Embedding a UIMA TAE in GATE
  16.1.1 Mapping File Format
  16.1.2 The UIMA component descriptor
  16.1.3 Using the AnalysisEnginePR
  16.1.4 Current limitations
 16.2 Embedding a GATE CorpusController in UIMA
  16.2.1 Mapping file format
  16.2.2 The GATE application definition
  16.2.3 Configuring the GATEApplicationAnnotator
Appendices
A Design Notes
 A.1 Patterns
  A.1.1 Components
  A.1.2 Model, view, controller
  A.1.3 Interfaces
 A.2 Exception Handling
B JAPE: Implementation
 B.1 Formal Description of the JAPE Grammar
 B.2 Relation to CPSL
 B.3 Algorithms for JAPE Rule Application
  B.3.1 The first algorithm
  B.3.2 Algorithm 2
 B.4 Label Binding Scheme
 B.5 Classes
 B.6 Implementation
  B.6.1 A Walk-Through
  B.6.2 Example RHS code
 B.7 Compilation
 B.8 Using a Different Java Compiler
C Ant Tasks for GATE
 C.1 Declaring the Tasks
 C.2 The packagegapp task - bundling an application with its dependencies
  C.2.1 Introduction
  C.2.2 Basic Usage
  C.2.3 Handling non-plugin resources
  C.2.4 Streamlining your plugins
  C.2.5 Bundling extra resources
 C.3 The expandcreoles task - merging annotation-driven config into creole.xml
D Named-Entity State Machine Patterns
 D.1 Main.jape
 D.2 first.jape
 D.3 firstname.jape
 D.4 name.jape
  D.4.1 Person
  D.4.2 Location
  D.4.3 Organization
  D.4.4 Ambiguities
  D.4.5 Contextual information
 D.5 name_post.jape
 D.6 date_pre.jape
 D.7 date.jape
 D.8 reldate.jape
 D.9 number.jape
 D.10 address.jape
 D.11 url.jape
 D.12 identifier.jape
 D.13 jobtitle.jape
 D.14 final.jape
 D.15 unknown.jape
 D.16 name_context.jape
 D.17 org_context.jape
 D.18 loc_context.jape
 D.19 clean.jape
E Part-of-Speech Tags used in the Hepple Tagger
F Sample ML Configuration File
G IAA Measures for Classification Tasks
H Keyboard shortcuts for GATE
References

Chapter 1
Introduction [#]

Software documentation is like sex: when it is good, it is very, very good; and when it is bad, it is better than nothing. (Anonymous.)

There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies; the other way is to make it so complicated that there are no obvious deficiencies. (C.A.R. Hoare)

A computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. (The Structure and Interpretation of Computer Programs, H. Abelson, G. Sussman and J. Sussman, 1985.)

If you try to make something beautiful, it is often ugly. If you try to make something useful, it is often beautiful. (Oscar Wilde)1

GATE is an infrastructure for developing and deploying software components that process human language. GATE helps scientists and developers in three ways:

  1. by specifiying an architecture, or organisational structure, for language processing software;
  2. by providing a framework, or class library, that implements the architecture and can be used to embed language processing capabilities in diverse applications;
  3. by providing a development environment built on top of the framework made up of convenient graphical tools for developing components.

The architecture exploits component-based software development, object orientation and mobile code. The framework and development environment are written in Java and available as open-source free software under the GNU library (or lesser) licence2. GATE uses Unicode throughout [Unicode Consortium 96Tablan et al. 02], and has been tested on a variety of Slavic, Germanic, Romance, and Indic languages [Maynard et al. 01Gambäck & Olsson 00McEnery et al. 00].

From a scientific point-of-view, GATE’s contribution is to quantitative measurement of accuracy and repeatability of results for verification purposes.

GATE has been in development at the University of Sheffield since 1995 and has been used in a wide variety of research and development projects [Maynard et al. 00]. Version 1 of GATE was released in 1996, was licensed by several hundred organisations, and used in a wide range of language analysis contexts including Information Extraction ([Cunningham 99bAppelt 99Gaizauskas & Wilks 98Cowie & Lehnert 96]) in English, Greek, Spanish, Swedish, German, Italian, French, Bulgarian, Russian, and a number of other languages. Version 4 of the system is available from http://gate.ac.uk/download/.

This book describes how to use GATE to develop language processing components, test their performance and deploy them as parts of other applications. In the rest of this chapter:

Note: if you don’t see the component you need in this document, or if we mention a component that you can’t see in the software, contact gate-users@lists.sourceforge.net3 – various components are developed by our collaborators, who we will be happy to put you in contact with. (Often the process of getting a new component is as simple as typing the URL into GATE; the system will do the rest.)

1.1 How to Use This Text [#]

It is a good idea to read all of this introduction (you can skip sections 1.2 and 1.5 if pressed); then you can either continue wading through the whole thing or just use chapter 3 as a reference and dip into other chapters for more detail as necessary. Chapter 3 gives instructions for completing common tasks with GATE, organised in a FAQ style: details, and the reasoning behind the various aspects of the system, are omitted in this chapter, so where more information is needed refer to later chapters.

The structure of the book as a whole is detailed in section 1.4 below.

1.2 Context [#]

GATE can be thought of as a Software Architecture for Language Engineering [Cunningham 00].

‘Software Architecture’ is used rather loosely here to mean computer infrastructure for software development, including development environments and frameworks, as well as the more usual use of the term to denote a macro-level organisational structure for software systems [Shaw & Garlan 96].

Language Engineering (LE) may be defined as:

…the discipline or act of engineering software systems that perform tasks involving processing human language. Both the construction process and its outputs are measurable and predictable. The literature of the field relates to both application of relevant scientific results and a body of practice. [Cunningham 99a]

The relevant scientific results in this case are the outputs of Computational Linguistics, Natural Language Processing and Artificial Intelligence in general. Unlike these other disciplines, LE, as an engineering discipline, entails predictability, both of the process of constructing LE-based software and of the performance of that software after its completion and deployment in applications.

Some working definitions:

  1. Computational Linguistics (CL): science of language that uses computation as an investigative tool.
  2. Natural Language Processing (NLP): science of computation whose subject matter is data structures and algorithms for computer processing of human language.
  3. Language Engineering (LE): building NLP systems whose cost and outputs are measurable and predictable.
  4. Software Architecture: macro-level organisational principles for families of systems. In this context is also used as infrastructure.
  5. Software Architecture for Language Engineering (SALE): software infrastructure, architecture and development tools for applied CL, NLP and LE.

(Of course the practice of these fields is broader and more complex than these definitions.)

In the scientific endeavours of NLP and CL, GATE’s role is to support experimentation. In this context GATE’s significant features include support for automated measurement (see section 13), providing a ‘level playing field’ where results can easily be repeated across different sites and environments, and reducing research overheads in various ways.

1.3 Overview [#]

1.3.1 Developing and Deploying Language Processing Facilities [#]

GATE as an architecture suggests that the elements of software systems that process natural language can usefully be broken down into various types of component, known as resources4. Components are reusable software chunks with well-defined interfaces, and are a popular architectural form, used in Sun’s Java Beans and Microsoft’s .Net, for example. GATE components are specialised types of Java Bean, and come in three flavours:

These definitions can be blurred in practice as necessary.

Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection of REusable Objects for Language Engineering. All the resources are packaged as Java Archive (or ‘JAR’) files, plus some XML configuration data. The JAR and XML files are made available to GATE by putting them on a web server, or simply placing them in the local file space. Section 1.3.2 introduces GATE’s built-in resource set.

When using GATE to develop language processing functionality for an application, the developer uses the development environment and the framework to construct resources of the three types. This may involve programming, or the development of Language Resources such as grammars that are used by existing Processing Resources, or a mixture of both. The development environment is used for visualisation of the data structures produced and consumed during processing, and for debugging, performance measurement and so on. For example, figure 1.1 is a screenshot of one of the visualisation tools


PIC


Figure 1.1: One of GATE’s visual resources


(displaying named-entity extraction results for a Hindi sentence).

The GATE development environment is analogous to systems like Mathematica for Mathematicians, or JBuilder for Java programmers: it provides a convenient graphical environment for research and development of language processing software.

When an appropriate set of resources have been developed, they can then be embedded in the target client application using the GATE framework. The framework is supplied as two JAR files.5 To embed GATE-based language processing facilities in an application, these JAR files are all that is needed, along with JAR files and XML configuration files for the various resources that make up the new facilities.

1.3.2 Built-in Components [#]

GATE includes resources for common LE data structures and algorithms, including documents, corpora and various annotation types, a set of language analysis components for Information Extraction and a range of data visualisation and editing components.

GATE supports documents in a variety of formats including XML, RTF, email, HTML, SGML and plain text. In all cases the format is analysed and converted into a single unified model of annotation. The annotation format is a modified form the TIPSTER format [Grishman 97] which has been made largely compatible with the Atlas format [Bird & Liberman 99], and uses the now standard mechanism of ‘stand-off markup’. GATE documents, corpora and annotations are stored in databases of various sorts, visualised via the development environment, and accessed at code level via the framework. See chapter 6 for more details of corpora etc.

A family of Processing Resources for language analysis is included in the shape of ANNIE, A Nearly-New Information Extraction system. These components use finite state techniques to implement various tasks from tokenisation to semantic tagging or verb phrase chunking. All ANNIE components communicate exclusively via GATE’s document and annotation resources. See chapter 8 for more details. See chapter 5 for visual resources. See chapter 9 for other miscellaneous CREOLE resources.

1.3.3 Additional Facilities [#]

Three other facilities in GATE deserve special mention:

And by version 4 it will make a mean cup of tea.

1.3.4 An Example [#]

This section gives a very brief example of a typical use of GATE to develop and deploy language processing capabilities in an application, and to generate quantitative results for scientific publication.

Let’s imagine that a developer called Fatima is building an email client7 for Cyberdyne Systems’ large corporate Intranet. In this application she would like to have a language processing system that automatically spots the names of people in the corporation and transforms them into mailto hyperlinks.

A little investigation shows that GATE’s existing components can be tailored to this purpose. Fatima starts up the development environment, and creates a new document containing some example emails. She then loads some processing resources that will do named-entity recognition (a tokeniser, gazetteer and semantic tagger), and creates an application to run these components on the document in sequence. Having processed the emails, she can see the results in one of several viewers for annotations.

The GATE components are a decent start, but they need to be altered to deal specially with people from Cyberdyne’s personnel database. Therefore Fatima creates new “cyber-” vesions of the gazetteer and semantic tagger resources, using the “bootstrap” tool. This tool creates a directory structure on disk that has some Java stub code, a Makefile and an XML configuration file. After several hours struggling with badly written documentation, Fatima manages to compile the stubs and create a JAR file containing the new resources. She tells GATE the URL of these files8, and the system then allows her to load them in the same way that she loaded the built-in resources earlier on.

Fatima then creates a second copy of the email document, and uses the annotation editing facilities to mark up the results that she would like to see her system producing. She saves this and the version that she ran GATE on into her Oracle datastore (set up for her by the Herculean efforts of the Cyberdyne technical support team, who like GATE because it enables them to claim lots of overtime). From now on she can follow this routine:

  1. Run her application on the email test corpus.
  2. Check the performance of the system by running the ‘annotation diff’ tool to compare her manual results with the system’s results. This gives her both percentage accuracy figures and a graphical display of the differences between the machine and human outputs.
  3. Make edits to the code, pattern grammars or gazetteer lists in her resources, and recompile where necessary.
  4. Tell GATE to re-initialise the resources.
  5. Go to 1.

To make the alterations that she requires, Fatima re-implements the ANNIE gazetteer so that it regenerates itself from the local personnel data. She then alters the pattern grammar in the semantic tagger to prioritise recognition of names from that source. This latter job involves learning the JAPE language (see chapter 7), but as this is based on regular expressions it isn’t too difficult.

Eventually the system is running nicely, and her accuracy is 93% (there are still some problem cases, e.g. when people use nicknames, but the performance is good enough for production use). Now Fatima stops using the GATE development environment and works instead on embedding the new components in her email application. This application is written in Java, so embedding is very easy9: the two GATE JAR files are added to the project CLASSPATH, the new components are placed on a web server, and with a little code to do initialisation, loading of components and so on, the job is finished in half a day – the code to talk to GATE takes up only around 150 lines of the eventual application, most of which is just copied from the example in the sheffield.examples.StandAloneAnnie class.

Because Fatima is worried about Cyberdyne’s unethical policy of developing Skynet to help the large corporates of the West strengthen their strangle-hold over the World, she wants to get a job as an academic instead (so that her conscience will only have to cope with the torture of students, as opposed to humanity). She takes the accuracy measures that she has attained for her system and writes a paper for the Journal of Nasturtium Logarithm Encitement describing the approach used and the results obtained. Because she used GATE for development, she can cite the repeatability of her experiments and offer access to example binary versions of her software by putting them on an external web server.

And everybody lived happily ever after.

1.4 Structure of the Book [#]

The material presented in this book ranges from the conceptual (e.g. ‘what is software architecture?’) to practical instructions for programmers (e.g. how to deal with GATE exceptions) and linguists (e.g. how to write a pattern grammar). This diversity is something of an organisational challenge. Our (no doubt imperfect) solution is to collect specific instructions for ‘how to do X’ in a separate chapter (3). Other chapters give a more discursive presentation. In order to understand the whole system you must, unfortunately, read much of the book; in order to get help with a particular task, however, look first in chapter 3 and refer to other material as necessary.

The other chapters:

Chapter 4 describes the GATE architecture’s component-based model of language processing, describes the lifecycle of GATE components, and how they can be grouped into applications and stored in databases and files.

Chapter 5 describes the set of Visual Resources that are bundled with GATE.

Chapter 6 describes GATE’s model of document formats, annotated documents, annotation types, and corpora (sets of documents). It also covers GATE’s facilities for reading and writing in the XML data interchange language.

Chapter 7 describes JAPE, a pattern/action rule language based on regular expressions over annotations on documents. JAPE grammars compile into cascaded finite state transducers.

Chapter 8 describes ANNIE, a pipelined Information Extraction system which is supplied with GATE.

Chapter 9 describes CREOLE resources bundled with the system that don’t fit into the previous categories.

Chapter 10 describes processing resources and language resources for working with ontologies.

Chapter 11 describes a machine learning layer specifically targetted at NLP tasks including text classification, chunk learning (e.g. for named entity recognition) and relation learning.

Chapter 13 describes how to measure the performance of language analysis components.

Chapter 14 describes the data store security model.

Appendix A discusses the design of the system.

Appendix B describes the implementation details and formal definitions of the JAPE annotation patterns language.

Appendix D describes in some detail the JAPE pattern grammars that are used in ANNIE for named-entity recognition.

1.5 Further Reading [#]

Lots of documentation lives on the GATE web server, including:

For more details about Sheffield University’s work in human language processing see the NLP group pages or A Definition and Short History of Language Engineering ([Cunningham 99a]). For more details about Information Extraction see IE, a User Guide or the GATE IE pages.

A list of publications on GATE and projects that use it (some of which are available on-line):

[Cunningham 05]
is an overview of the field of Information Extraction for the 2nd Edition of the Encyclopaedia of Language and Linguistics.
[Cunningham & Bontcheva 05]
is an overview of the field of Software Architecture for Language Engineering for the 2nd Edition of the Encyclopaedia of Language and Linguistics.
[Li et al. 04]
(Machine Learning Workshop 2004) describes an SVM based learning algortihm for IE using GATE.
[Wood et al. 04]
(NLDB 2004) looks at ontology-based IE from parallel texts.
[Cunningham & Scott 04b]
(JNLE) is a collection of papers covering many important areas of Software Architecture for Language Engineering.
[Cunningham & Scott 04a]
(JNLE) is the introduction to the above collection.
[Bontcheva 04]
(LREC 2004) describes lexical and ontological resources in GATE used for Natural Language Generation.
[Bontcheva et al. 04]
(JNLE) discusses developments in GATE in the early naughties.
[Maynard et al. 04a]
(LREC 2004) presents algorithms for the automatic induction of gazetteer lists from multi-language data.
[Maynard et al. 04c]
(AIMSA 2004) presents automatic creation and monitoring of semantic metadata in a dynamic knowledge portal.
[Maynard et al. 04b]
(ESWS 2004) discusses ontology-based IE in the hTechSight project.
[Dimitrov et al. 04]
(Anaphora Processing) gives a lightweight method for named entity coreference resolution.
[Kiryakov 03]
(Technical Report) discusses semantic web technology in the context of multimedia indexing and search.
[Tablan et al. 03]
(HLT-NAACL 2003) presents the OLLIE on-line learning for IE system.
[Wood et al. 03]
(Recent Advances in Natural Language Processing 2003) discusses using parallel texts to improve IE recall.
[Maynard et al. 03a]
(Recent Advances in Natural Language Processing 2003) looks at semantics and named-entity extraction.
[Maynard et al. 03b]
(ACL Workshop 2003) describes NE extraction without training data on a language you don’t speak (!).
[Maynard et al. ]
(EACL 2003) looks at the distinction between information and content extraction.
[Manov et al. 03]
(HLT-NAACL 2003) describes experiments with geographic knowledge for IE.
[Saggion et al. 03a]
(EACL 2003) discusses robust, generic and query-based summarisation.
[Saggion et al. 03c]
(EACL 2003) discusses event co-reference in the MUMIS project.
[Saggion et al. 03b]
(Data and Knowledge Engineering) discusses multimedia indexing and search from multisource multilingual data.
[Cunningham et al. 03]
(Corpus Linguistics 2003) describes GATE as a tool for collaborative corpus annotation.
[Bontcheva et al. 03]
(NLPXML-2003) looks at GATE for the semantic web.
[Dimitrov 02a, Dimitrov et al. 02]
(DAARC 2002, MSc thesis) discuss lightweight coreference methods.
[Lal 02]
(Master Thesis) looks at text summarisation using GATE.
[Lal & Ruger 02]
(ACL 2002) looks at text summarisation using GATE.
[Cunningham et al. 02]
(ACL 2002) describes the GATE framework and graphical development environment as a tool for robust NLP applications.
[Bontcheva et al. 02b]
(NLIS 2002) discusses how GATE can be used to create HLT modules for use in information systems.
[Tablan et al. 02]
(LREC 2002) describes GATE’s enhanced Unicode support.
[Maynard et al. 02a]
(ACL 2002 Summarisation Workshop) describes using GATE to build a portable IE-based summarisation system in the domain of health and safety.
[Maynard et al. 02c]
(Nordic Language Technology) describes various Named Entity recognition projects developed at Sheffield using GATE.
[Maynard et al. 02b]
(AIMSA 2002) describes the adaptation of the core ANNIE modules within GATE to the ACE (Automatic Content Extraction) tasks.
[Maynard et al. 02d]
(JNLE) describes robustness and predictability in LE systems, and presents GATE as an example of a system which contributes to robustness and to low overhead systems development.
[Bontcheva et al. 02c], [Dimitrov 02a] and [Dimitrov 02b]
(TALN 2002, DAARC 2002, MSc thesis) describe the shallow named entity coreference modules in GATE: the orthomatcher which resolves pronominal coreference, and the pronoun resolution module.
[Bontcheva et al. 02a]
(ACl 2002 Workshop) describes how GATE can be used as an environment for teaching NLP, with examples of and ideas for future student projects developed within GATE.
[Pastra et al. 02]
(LREC 2002) discusses the feasibility of grammar reuse in applications using ANNIE modules.
[Baker et al. 02]
(LREC 2002) report results from the EMILLE Indic languages corpus collection and processing project.
[Saggion et al. 02b] and [Saggion et al. 02a]
(LREC 2002, SPLPT 2002) describes how ANNIE modules have been adapted to extract information for indexing multimedia material.
[Maynard et al. 01]
(RANLP 2001) discusses a project using ANNIE for named-entity recognition across wide varieties of text type and genre.
[Cunningham 00]
(PhD thesis) defines the field of Software Architecture for Language Engineering, reviews previous work in the area, presents a requirements analysis for such systems (which was used as the basis for designing GATE versions 2 and 3), and evaluates the strengths and weaknesses of GATE version 1.
[Cunningham 02]
(Computers and the Humanities) describes the philosophy and motivation behind the system, describes GATE version 1 and how well it lived up to its design brief.
[McEnery et al. 00]
(Vivek) presents the EMILLE project in the context of which GATE’s Unicode support for Indic languages has been developed.
[Cunningham et al. 00d] and [Cunningham 99c]
(technical reports) document early versions of JAPE (superceded by the present document).
[Cunningham et al. 00a], [Cunningham et al. 98a] and [Peters et al. 98]
(OntoLex 2000, LREC 1998) presents GATE’s model of Language Resources, their access and distribution.
[Maynard et al. 00]
(technical report) surveys users of GATE up to mid-2000.
[Cunningham et al. 00c] and [Cunningham et al. 99]
(COLING 2000, AISB 1999) summarise experiences with GATE version 1.
[Cunningham et al. 00b]
(LREC 2000) taxonomises Language Engineering components and discusses the requirements analysis for GATE version 2.
[Bontcheva et al. 00] and [Brugman et al. 99]
(COLING 2000, technical report) describe a prototype of GATE version 2 that integrated with the EUDICO multimedia markup tool from the Max Planck Institute.
[Gambäck & Olsson 00]
(LREC 2000) discusses experiences in the Svensk project, which used GATE version 1 to develop a reusable toolbox of Swedish language processing components.
[Cunningham 99a]
(JNLE) reviewed and synthesised definitions of Language Engineering.
[Stevenson et al. 98] and [Cunningham et al. 98b]
(ECAI 1998, NeMLaP 1998) report work on implementing a word sense tagger in GATE version 1.
[Cunningham et al. 97b]
(ANLP 1997) presents motivation for GATE and GATE-like infrastructural systems for Language Engineering.
[Gaizauskas et al. 96b, Cunningham et al. 97a, Cunningham et al. 96e]
(ICTAI 1996, TITPSTER 1997, NeMLaP 1996) report work on GATE version 1.
[Cunningham et al. 96c, Cunningham et al. 96d, Cunningham et al. 95]
(COLING 1996, AISB Workshop 1996, technical report) report early work on GATE version 1.
[Cunningham et al. 96b]
(TIPSTER) discusses a selection of projects in Sheffield using GATE version 1 and the TIPSTER architecture it implemented.
[Cunningham et al. 96a]
(manual) was the guide to developing CREOLE components for GATE version 1.
[Gaizauskas et al. 96a]
(manual) was the user guide for GATE version 1.
[Humphreys et al. 96]
(manual) desribes the language processing components distributed with GATE version 1.
[Cunningham 94, Cunningham et al. 94]
(NeMLaP 1994, technical report) argue that software engineering issues such as reuse, and framework construction, are important for language processing R&D.
[Dowman et al. 05b]
(World Wide Web Conference Paper) The Web is used to assist the annotation and indexing of broadcast news.
[Dowman et al. 05a]
(Euro Interactive Television Conference Paper) A system which can use material from the Internet to augment television news broadcasts.
[Dowman et al. 05c]
(Second European Semantic Web Conference Paper) A system that semantically annotates television news broadcasts using news websites as a resource to aid in the annotation process.
[Li et al. 05a]
(Proceedings of Sheffield Machine Learning Workshop) describe an SVM based IE system which uses the SVM with uneven margins as learning component and the GATE as NLP processing module.
[Li et al. 05b]
(Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005)) uses the uneven margins versions of two popular learning algorithms SVM and Perceptron for IE to deal with the imbalanced classification problems derived from IE.
[Li et al. 05c]
(Proceedings of Fourth SIGHAN Workshop on Chinese Language processing (Sighan-05)) used Perceptron learning, a simple, fast and effective learning algorithm, for Chinese word segmentation.
[Aswani et al. 05]
(Proceedings of Fifth International Conference on Recent Advances in Natural Language Processing (RANLP2005)) It is a full-featured annotation indexing and search engine, developed as a part of the GATE. It is powered with Apache Lucene technology and indexes a variety of documents supported by the GATE.
[Li et al. 05c]
(Proceedings of Fourth SIGHAN Workshop on Chinese Language processing (Sighan-05)) a system for Chinese word segmentation based on Perceptron learning, a simple, fast and effective learning algorithm.
[Wang et al. 05]
(Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005)) Extracting a Domain Ontology from Linguistic Resource Based on Relatedness Measurements.
[Ursu et al. 05]
(Proceedings of the 2nd European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies (EWIMT 2005))Digital Media Preservation and Access through Semantically Enhanced Web-Annotation.
[Polajnar et al. 05] (University of Sheffield-Research Memorandum CS-05-10) User-Friendly Ontology Authoring Using a Controlled Language.
[Aswani et al. 06] (Proceedings of the 5th International Semantic Web Conference (ISWC2006)) In this paper the problem of disambiguating author instances in ontology is addressed. We describe a web-based approach that uses various features such as publication titles, abstract, initials and co-authorship information.

Chapter 2
Change Log [#]

This chapter lists major changes to GATE in roughly chronological order by release. Changes in the documentation are also referenced here.

2.1 Version 5.0 (May 2009) [#]

Note: existing users – if you delete your user configuration file for any reason you will find that GATE no longer loads the ANNIE plugin by default. You will need to manually select “load always” in the plugin manager to get the old behaviour.

2.1.1 Major new features

JAPE language improvements

Several new extensions to the JAPE language to support more flexible pattern matching. Full details are in chapter 7 but briefly:

Some of these extensions are similar to, but not the same as, those provided by the Montreal Transducer plugin. If you are already familiar with the Montreal Transducer, you should first look at section 7.11 which summarises the differences.

Resource configuration via Java 5 Annotations

Introduced an alternative style for supplying resource configuration information via Java 5 annotations rather than in creole.xml. The previous approach is still fully supported as well, and the two styles can be freely mixed. See section 4.9 for full details.

Ontology-based Gazetteer

Added a new plugin Ontology_Based_Gazetteer, which contains OntoRoot Gazetteer – a dynamically created gazetteer which is, in combination with few other generic GATE resources, capable of producing ontology-aware annotations over the given content with regards to the given ontology. For more details see section 9.31.

Inter-annotator agreement and merging

New plugins to support tasks involving several annotators working on the same annotation task on the same documents. The “iaaPlugin” (section 13.6) computes inter-annotator agreement scores between the annotators, the “copyAS2AnoDoc” plugin (section 9.33) copies annotations from several parallel documents into a single master document, and the “annotationMerging” plugin (section 9.30) merges annotations from multiple annotators into a single “consensus” annotation set.

Packaging self-contained applications for GATE Teamware

Added a mechanism to assemble a saved GATE application along with all the resource files it uses into a single self-contained package to run on another machine (e.g. as a service in GATE Teamware). This is available as a menu option (section 3.24) which will work for most common cases, but for complex cases you can use the underlying Ant task described in section C.2.

GUI improvements

2.1.2 Other new features and improvements

2.1.3 Specific bug fixes

Plus many more minor bug fixes

2.2 Version 4.0 (July 2007) [#]

2.2.1 Major new features

ANNIC

ANNotations In Context: a full-featured annotation indexing and retrieval system designed to support corpus querying and JAPE rule authoring. It is provided as part of an extention of the Serial Datastores, called Searchable Serial Datastore (SSD). See section 9.29 for more details.

New machine learning API

A brand new machine learning layer specifically targetted at NLP tasks including text classification, chunk learning (e.g. for named entity recognition) and relation learning. See chapter 11 for more details.

Ontology API

A new ontology API, based on OWL In Memory (OWLIM), which offers a better API, revised ontology event model and an improved ontology editor to name but few. See chapter 10 for more details.

OCAT

Ontology-based Corpus Annotation Tool to help annotators to manually annotate documents using ontologies. For more details please see section 10.9.

Alignment Tools

A new set of components (e.g. CompoundDocument, AlignmentEditor etc.) that help in building alignment tools and in carrying out cross-document processing. See chapter 12 for more details.

New HTML Parser

A new HTML document format parser, based on Andy Clark’s NekoHTML. This parser is much better than the old one at handling modern HTML and XHTML constructs, JavaScript blocks, etc., though the old parser is still available for existing applications that depend on its behaviour.

Java 5.0 support

GATE now requires Java 5.0 or later to compile and run. This brings a number of benefits:

2.2.2 Other new features and improvements

2.2.3 Bug fixes and optimizations

And as always there are many smaller bugfixes too numerous to list here...

2.3 Version 3.1 (April 2006)

2.3.1 Major new features

Support for UIMA

UIMA (http://www.research.ibm.com/UIMA/) is a language processing framework developed by IBM. UIMA and GATE share some functionality but are complementary in most respects. GATE now provides an interoperability layer to allow UIMA applications to include GATE components in their processing and vice-versa. For full information, see chapter 16.

New Ontology API

The ontology layer has been rewritten in order to provide an abstraction layer between the model representation and the tools used for input and output of the various representation formats. An implementation that uses Jena 2 (http://jena.sourceforge.net/ontology) for reading and writing OWL and RDF(S) is provided.

Ontotext Japec Compiler

Japec is a compiler for JAPE grammars developed by Ontotext Lab. It has some limitations compared to the standard JAPE transducer implementation, but can run JAPE grammars up to five times as fast. By default, GATE still uses the stable JAPE implementation, but if you want to experiment with Japec, see section 9.28.

2.3.2 Other new features and improvements

2.3.3 Bug fixes

2.4 January 2005

Release of version 3.

New plugins for processing in various languages (see 9.15). These are not full IE systems but are designed as starting points for further development (French, German, Spanish, etc.), or as sample or toy applications (Cebuano, Hindi, etc.).

Other new plugins:

Support for SVM Light, a support vector machine implementation, has been added to the machine learning plugin (see section 9.24.7).

2.5 December 2004

GATE no longer depends on the Sun Java compiler to run, which means it will now work on any Java runtime environment of at least version 1.4. JAPE grammars are now compiled using the Eclipse JDT Java compiler by default.

A welcome side-effect of this change is that it is now much easier to integrate GATE-based processing into web applications in Tomcat. See section 3.30 for details.

2.6 September 2004

GATE applications are now saved in XML format using the XStream library, rather than by using native java serialization. On loading an application, GATE will automatically detect whether it is in the old or the new format, and so applications in both formats can be loaded. However, older versions of GATE will be unable to load applications saved in the XML format. (A java.io.StreamCorruptedException: invalid stream header exception will occcur.) It is possible to get new versions of GATE to use the old format by setting a flag in the source code. (See the Gate.java file for details.) This change has been made because it allows the details of an application to be viewed and edited in a text editor, which is sometimes easier than loading the application into GATE.

2.7 Version 3 Beta 1 (August 2004)

Version 3 incorporates a lot of new functionality and some reorganisation of existing components.

Note that Beta 1 is feature-complete but needs further debugging (please send us bug reports!).

Highlights include: completely rewritten document viewer/editor; extensive ontology support; a new plugin management system; separate .jar files and a Tomcat classloading fix; lots more CREOLE components (and some more to come soon).

Almost all the changes are backwards-compatible; some recent classes have been renamed (particularly the ontologies support classes) and a few events added (see below); datastores created by version 3 will probably not read properly in version 2. If you have problems use the mailing list and we’ll help you fix your code!

The gorey details:

2.8 July 2004

GATE Documents now fire events when the document content is edited. This was added in order to support the new facility of editing documents from the GUI. This change will break backwards compatibility by requiring all DocumentListener implementations to implement a new method:
public void contentEdited(DocumentEvent e);

2.9 June 2004

A new algorithm has been implemented for the AnnotationDiff function. A new, more usable, GUI is included, and an ”Export to HTML” option added. More details about the AnnotationDiff tool are in Section 3.25.

A new build process, based on ANT (http://ant.apache.org/) is now available for GATE. The old build process, based on make, is now unsupported. See Section 3.8 for details of the new build process.

A Jape Debugger from Ontos AG has been integrated in GATE. You can turn integration ON with command line option ”-j”. If you run the GATE GUI with this option, the new menu item for Jape Debugger GUI will appear in the Tools menu. The default value of integration is OFF. We are currently awaiting documentation for this.

NOTE! Keep in mind there is ClassCastExceprion if you try to debug ConditionalCorpusPipeline. Jape Debugger is designed for Corpus Pipeline only. The Ontos code needs to be changed to allow debugging of ConditionalCorpusPipeline.

2.10 April 2004

GATE now has two alternative strategies for ontology-aware grammar transduction:

The changes are in:

More information about the ontology-aware transducer can be found in Section 10.6.

A morphological analyser PR has been added to GATE. This finds the root and affix values of a token and adds them as features to that token.

A flexible gazetteer PR has been added to GATE. This performs lookup over a document based on the values of an arbitrary feature of an arbitrary annotation type, by using an externally provided gazetteer. See 9.5 for details.

2.11 March 2004

Support was added for the MAXENT machine learning library. (See 9.24.6 for details.)

2.12 Version 2.2 – August 2003

Note that GATE 2.2 works with JDK 1.4.0 or above. Version 1.4.2 is recommended, and is the one included with the latest installers.

GATE has been adapted to work with Postgres 7.3. The compatibility with PostgreSQL 7.2 has been preserved. See 3.38 for more details.

New library version – Lucene 1.3 (rc1)

A bug in gate.util.Javac has been fixed in order to account for situations when String literals require an encoding different from the platform default.

Temporary .java files used to compile JAPE RHS actions are now saved using UTF-8 and the ”-encoding UTF-8” option is passed to the javac compiler.

A custom tools.jar is no longer necessary

Minor changes have been made to the look and feel of GATE to improve its appearance with JDK 1.4.2

Some bug fixes (087, 088, 089, 090, 091, 092, 093, 095, 096 – see http://gate.ac.uk/gate/doc/bugs.html for more details).

2.13 Version 2.1 – February 2003

Integration of Machine Learning PR and WEKA wrapper (see Section 9.24).

Addition of DAML+OIL exporter.

Integration of WordNet in GATE (see Section 9.23).

The syntax tree viewer has been updated to fix some bugs.

2.14 June 2002

Conditional versions of the controllers are now available (see Section 3.15). These allow processing resources to be run conditionally on document features.

PostgreSQL Data Stores are now supported (see Section 4.7). These store data into a PostgreSQL RDBMS.

Addition of OntoGazetteer (see Section 5.2), an interface which makes ontologies visible within GATE, and supports basic methods for hierarchy management and traversal.

Integration of Protégé, so that people with developed Protégé ontologies can use them within GATE.

Addition of IR facilities in GATE (see Section 9.19).

Modification of the corpus benchmark tool (see Section 3.26), which now takes an application as a parameter.

See also for details of other recent bug fixes.

Chapter 3
How To… [#]

“The law of evolution is that the strongest survives!”

“Yes; and the strongest, in the existence of any social species, are those who are most social. In human terms, most ethical. …There is no strength to be gained from hurting one another. Only weakness.”

The Dispossessed [p.183], Ursula K. le Guin, 1974.

This chapter describes how to complete common tasks using GATE.

Read first the sections with an asterisk (*).

Sections that relate to the Development Environment are flagged [D]; those that relate to the framework are flagged [F]; sections relating to both are flagged [D,F].

There are two other primary sources for this type of information:

  1. for the development environment, see the visual tutorials available on our movies page;
  2. for the framework, see the example code at http://gate.ac.uk/gate-examples/doc/.

3.1 Download GATE* [#]

To download GATE point your web browser at http://gate.ac.uk/download/.

You should next read the section 3.2 to install and run GATE.

3.2 Install and Run GATE* [#]

GATE 3.1 will run anywhere that supports Java version 1.4.2 or later, including Solaris, Linux and Windoze platforms. GATE 4 and 5 require Java 5.0. We don’t run tests on other platforms, but have had reports of successful installs elsewhere. We are also testing released installers on MacOS X.

You should next read the section 3.6 to know about the GUI.

3.2.1 The Easy Way

The easy way to install is to use one of the platform-specific installers (created using the excellent IzPack). Download a ‘platform-specific installer’ and follow the instructions it gives you. Once the installation is complete, you can start GATE using gate.exe (Windows) or GATE.app (Mac) in the top-level installation directory, or gate.sh in the bin directory (other platforms).

3.2.2 The Hard Way (1)

Download the Java-only release package or the binary build snapshot, and follow the instructions below.

Prerequisites:

Using the binary distribution:

The Ant scripts that start GATE (ant.bat or ant) requires you to set the JAVA_HOME environment variable to point to the top level directory of your JAVA installation. The value of GATE_CONFIG is passed to the system by the scripts using either a -i command-line option, or the Java property gate.config.

3.2.3 The Hard Way (2): Subversion [#]

The GATE code is maintained in a Subversion repository. You can use a Subversion client to check out the source code – the most up-to-date version of GATE is the trunk:
svn checkout https://gate.svn.sourceforge.net/svnroot/gate/gate/trunk gate

Once you have checked out the code you can build GATE using Ant (see section 3.8)

You can browse the complete Subversion repository online at http://gate.svn.sourceforge.net/gate.

3.3 [D,F] Use System Properties with GATE [#]

During initialisation, GATE reads several Java system properties in order to decide where to find its configuration files.

Here is a list of the properties used, their default values and their meanings:

gate.home
sets the location of the GATE install directory. This should point to the top level directory of your GATE installation. This is the only property that is required. If this is not set, the system will display an error message and them it will attempt to guess the correct value.
gate.plugins.home
points to the location of the directory containing installed GATE plug-ins (a.k.a. CREOLE directories). If this is not set then the default value of {gate.home}/plugins is used.
gate.site.config
points to the location of the configuration file containing the site-wide options. If not set this will default to {gate.home}/gate.xml. The site configuration file must exist!
gate.user.config
points to the file containing the user’s options. If not specified, or if the specified file does not exist at startup time, the default value of gate.xml (.gate.xml on Unix platforms) in the user’s home directory is used.
gate.user.session
points to the file containing the user’s saved session. If not specified, the default value of gate.session (.gate.session on Unix) in the user’s home directory is used. When starting up the GUI the session is reloaded from this file if it exists, and when exiting the GUI the session is saved to this file (unless the user has disabled “save session on exit” in the configuration dialog). The session is not used when using GATE as a library.
load.plugin.path
is a path-like structure, i.e. a list of URLs separated by ‘;’. All directories listed here will be loaded as CREOLE plugins during initialisation. This has similar functionality with the the -d command line option.
gate.builtin.creole.dir
is a URL pointing to the location of GATE’s built-in CREOLE directory. This is the location of the creole.xml file that defines the fundamental GATE resource types, such as documents, document format handlers, controllers and the basic visual resources that make up the GATE GUI. The default points to a location inside gate.jar and should not generally need to be overridden.

When using GATE as a library, you can set the values for these properties before you call Gate.init(). Alternatively, you can set the values programmatically using the static methods setGateHome(), setPluginsHome(), setSiteConfigFile(), etc. before calling Gate.init(). See the Javadoc documentation for details. If you want to set these values from the command line you can use the following syntax for setting gate.home for example:

java -Dgate.home=/my/new/gate/home/directory -cp... gate.Main

When running the GUI, you can set the properties by creating a file build.properties in the top level GATE directory. In this file, any system properties which are prefixed with “run.” will be passed to GATE. For example, to set an alternative user config file, put the following line in build.properties1:

run.gate.user.config=${user.home}/alternative-gate.xml

This facility is not limited to the GATE-specific properties listed above, for example the following line changes the default temporary directory for GATE (note the use of forward slashes, even on Windows platforms):

run.java.io.tmpdir=d:/bigtmp

3.4 [D,F] Use (CREOLE) Plug-ins [#]

The definitions of CREOLE resources (see Chapter 4) are stored in CREOLE directories (directories containing an XML file describing the resources, the Java archive with the compiled executable code and whatever libraries are required by the resources).

Starting with version 3, CREOLE directories are called “CREOLE Plugins” or simply “Plugins”. In previous versions, the CREOLE resources distributed with GATE used be included in the monolithic gate.jar archive. Version 3 includes them as separate directories under the plugins directory of the distribution. This allows easy access to the linguistic resources used without the requirement to unpack the gate.jar file.

Plugins can have one or more of the following states in relation with GATE:

known
plugins are those plugins that the system knows about. These include all the plugins in the plugins directory of the GATE installation (the so–called installed plugins) as well all the plugins that were manually loaded from the user interface.
loaded
plugins are the plugins currently loaded in the system. All CREOLE resource types from the loaded plugins are available for use. All known plugins can easily be loaded and unloaded using the user interface.
auto-loadable
plugins are the list of plugins that the system loads automatically during initialisation.

The default location for installed plugins can be modified using the gate.plugins.home system property while the list of auto-loadable plugins can be set using the load.plugin.path property, see Section 3.3 above.

The CREOLE plugins can be managed through the graphical user interface which can be activated by selecting “Manage CREOLE plugins” from the “File” menu. This will bring up a window listing all the known plugins. For each plugin there are two check-boxes – one labelled “Load now”, which will load the plugin, and the other labelled “Load always” which will add the plugin to the list of auto-loadable plugins. A “Delete” button is also provided – which will remove the plugin from the list of known plugins. Note the the installed plugins will return to the list of known plugins next time when GATE is started. They can only be removed by physically removing (or moving) the actual directory on disk outside the GATE plugins directory.

When using GATE as a library the following API calls are relevant to working with plugins:

Class gate.Gate

public static void addKnownPlugin(URL pluginURL)
adds the plugin to the list of known plugins.
public static void removeKnownPlugin(URL pluginURL)
tells the system to “forget” about one previously known directory. If the specified directory was loaded, it will be unloaded as well - i.e. all the metadata relating to resources defined by this directory will be removed from memory.
public static void addAutoloadPlugin(URL pluginUrl)
adds a new directory to the list of plugins that are loaded automatically at start-up.
public static void removeAutoloadPlugin(URL pluginURL)
tells the system to remove a plugin URL from the list of plugins that are loaded automatically at system start-up. This will be reflected in the user’s configuration data file.

Class gate.CreoleRegister

public void registerDirectories(URL directoryUrl)
loads a new CREOLE directory. The new plugin is added to the list of known plugins if not already there.
public void removeDirectory(URL directory)
unloads a loaded CREOLE plugin.

3.5 Troubleshooting

On Windoze 95 and 98, you may need to increase the amount of environment space available for the gate.bat script. Right click on the script, hit the memory tab and increase the ‘initial environment’ value to maximum.

Note that the gate.bat script uses javaw.exe to run GATE which means that you will see no console for the java process. If you have problems starting GATE and you would like to be able to see the console to check for messages then you should edit the gate.bat script and replace javaw.exe with java.exe in the definition of the JAVA environment variable.

When our FTP server is overloaded you may get a blank download link in the email sent to you after you register. Please try again later.

3.6 [D] Get Started with the GUI* [#]

Probably the best way to learn how to use the GATE graphical development environment is to look at the demonstrations and tutorials movies. There is specific links to them in this chapter.

This section gives a short description of what is where in the main window of the system.


PIC


Figure 3.1: Main Window


Figure 3.1 shows the main window of the application, with a single document loaded. There are five main areas of the window:

  1. the menus bar along the top, with ‘File’ etc.;
  2. in the top left of the main area, a tree starting from ‘GATE’ and containing ‘Applications’, ‘Language Resources’ etc. – this is the resources tree;
  3. in the bottom left of the main area, a rectangle, which is the small resource viewer;
  4. on the right of the main area, containing tabs with ‘Messages’ or the name of a resource from the resource tree, the main resource viewer;
  5. the messages bar along the bottom (where it says ‘Views built!’).

The menu and the messages bar do the usual things. Longer messages are displayed in the messages tab in the main resource viewer area.

The resource tree and resource viewer areas work together to allow the system to display diverse resources in various ways. Visual Resources integrated with GATE can have a small view or a large view. For example, data stores have a small view; documents have a large view.

All the resources, applications and datastores currently loaded in the system appear in the resources tree; double clicking on a resource will load a viewer for the resource in one of the resource view areas.

You should next read the section 3.12 to load creole resources.

3.7 [D,F] Configure GATE [#]

When the GATE development environment is started, or when Gate.init() is called from the API, GATE loads various sorts of configuration data stored as XML in files generally called something like gate.xml or .gate.xml. This data holds information such as:

All of this type of data is stored at two levels (in order from general to specific):

Where configuration data appears on several different levels, the more specific ones overwrite the more general. This means that you can set defaults for all GATE users on your system, for example, and allow individual users to override those defaults without interfering with others.

Configuration data can be set from the GUI via the ‘Options’ menu, ‘Configuration’ choice. The user can change the appearance of the GUI (via the Appearance submenu), which includes the options of font and the “look and feel”. The “Advanced” submenu enables the user to include annotation features when saving the document and preserving its format, to save the selected Options automatically on exit, and to save the session automatically on exit. The Input Methods menu (available via the Options menu) enables the user to change the default language for input. These options are all stored in the user’s .gate.xml file.

When using GATE from the framework, you can also set the site config location using Gate.setSiteConfigFile(File) prior to calling Gate.init().

3.7.1 [F] Save Config Data to gate.xml

Arbitrary feature/value data items can be saved to the user’s gate.xml file via the following API calls:

To get the config data: Map configData = Gate.getUserConfig().

To add config data simply put pairs into the map: configData.put("my new config key", "value");.

To write the config data back to the XML file: Gate.writeUserConfig();.

Note that new config data will simply override old values, where the keys are the same. In this way defaults can be set up by putting their values in the main gate.xml file, or the site gate.xml file; they can then be overridden by the user’s gate.xml file.

3.8 Build GATE [#]

Note that you don’t need to build GATE unless you’re doing development on the system itself.

Prerequisites:

GATE now includes a copy of the ANT build tool which can be accessed through the scripts included in the bin directory (use ant.bat for Windows 98 or ME, ant.cmd for Windows NT, 2000 or XP, and ant.sh for Unix platforms).

To build gate, cd to gate and:

  1. Type:
    bin/ant
  2. [optional] To test the system:
    bin/ant test
    (Note that DB tests may fail unless you can connect to Sheffield’s Oracle server.)
  3. [optional] To make the Javadoc documentation:
    bin/ant doc
  4. You can also run GATE using Ant, by typing:
    bin/ant run
  5. To see a full list of options type: bin/ant help

(The details of the build process are all specified by the build.xml file in the gate directory.)

You can also use a development environment like Borland JBuilder (click on the gate.jpx file), but note that it’s still advisable to use ant to generate documentation, the jar file and so on. Also note that the run configurations have the location of a gate.xml site configuration file hard-coded into them, so you may need to change these for your site.

3.9 [D] Use GATE with Maven or JPF [#]

This section is based on contributions by Georg ttl and William Oberman.

To use GATE with Maven you need a definition of the dependencies in POM format. There’s an example POM here.

To use GATE with JPF (a Java plugin framework) you need a plugin definition like this one.

3.10 [D,F] Create a New (CREOLE) Resource [#]

CREOLE resources are Java Beans (see chapter 4). They come in three types: Language Resource, Processing Resource and Visual Resource (see chapter 1 section 1.3.1). To create a new resource you need to:

The GATE development environment helps you with this process by creating a set of directories and files that implement a basic resource, including a Java code file and a Makefile. This process is called ‘bootstrapping’.

For example, let’s create a new component called GoldFish, which will be a Processing Resource that looks for all instances of the word ‘fish’ in a document and adds an annotation of type ‘GoldFish’.

First start the GATE development environment (see section 3.2). From the ‘Tools’


PIC


Figure 3.2: BootStrap Wizard Dialogue


menu select ‘BootStrap Wizard’, which will pop up the dialogue in figure 3.2. The meaning of the data entry fields:

Now we need to compile the class and package it into a JAR file. The bootstrap wizard creates an Ant build file that makes this very easy – so long as you have Ant set up properly, you can simply run

ant jar

This will compile the Java source code and package the resulting classes into GoldFish.jar. If you don’t have your own copy of Ant, you can use the one bundled with GATE - suppose your GATE is installed at /opt/gate-5.0-snapshot, then you can use /opt/gate-5.0-snapshot/bin/ant jar to build.

You can now load this resource into GATE; see

The default Java code that was created for our GoldFish resource looks like this:

/*  
 *  GoldFish.java  
 *  
 *  You should probably put a copyright notice here. Why not use the  
 *  GNU licence? (See http://www.gnu.org/.)  
 *  
 *  hamish, 26/9/2001  
 *  
 *  $Id: howto.tex,v 1.130 2006/10/23 12:56:37 ian Exp $  
 */  
 
package sheffield.creole.example;  
 
import java.util.*;  
import gate.*;  
import gate.creole.*;  
import gate.util.*;  
 
/**  
 * This class is the implementation of the resource GOLDFISH.  
 */  
@CreoleResource(name = "GoldFish",  
        comment = "Add a descriptive comment about this resource")  
public class GoldFish extends AbstractProcessingResource  
  implements ProcessingResource {  
 
 
} // class GoldFish

The default XML configuration for GoldFish looks like this:

<!-- creole.xml GoldFish -->  
<!--  hamish, 26/9/2001 -->  
<!-- $Id: howto.tex,v 1.130 2006/10/23 12:56:37 ian Exp $ -->  
 
<CREOLE-DIRECTORY>  
  <JAR SCAN="true">GoldFish.jar</JAR>  
</CREOLE-DIRECTORY>

The directory structure containing these files


PIC


Figure 3.3: BootStrap directory tree


is shown in figure 3.3. GoldFish.java lives in the src/sheffield/creole/example directory. creole.xml and build.xml are in the top GoldFish directory. The lib directory is for libraries; the classes directory is where Java class files are placed; the doc directory is for documentation. These last two, plus GoldFish.jar are created by Ant.

This process has the advantage that it creates a complete source tree and build structure for the component, and the disadvantage that it creates a complete source tree and build structure for the component. If you already have a source tree, you will need to chop out the bits you need from the new tree (in this case GoldFish.java and creole.xml) and copy it into your existing one.

See the example code at http://gate.ac.uk/gate-examples/doc/.

3.11 [F] Instantiate (CREOLE) Resources [#]

This section describes how to create CREOLE resources as objects in a running Java virtual machine. This process involves using GATE’s Factory class, and, in the case of LRs, may also involve using a DataStore.

CREOLE resources are Java Beans; creation of a resource object involves using a default constructor, then setting parameters on the bean, then calling an init() method3. The Factory takes care of all this, makes sure that the GUI is told about what is happening (when GUI components exist at runtime), and also takes care of restoring LRs from DataStores. So a programmer using GATE should never call the constructor of a resource: always use the Factory.

The valid parameters for a resource are described in the resource’s section of its creole.xml file or in Java annotations on the resource class – see section 4.9.

Creating a resource via the Factory involves passing values for any create-time parameters that require setting to the Factory’s createResource method. If no parameters are passed, the defaults are used. So, for example, the following code creates a default ANNIE part-of-speech tagger:

Gate.getCreoleRegister().registerDirectories(new File(  
  Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR).toURI().toURL());  
FeatureMap params = Factory.newFeatureMap(); // empty map: default parameters  
ProcessingResource tagger = (ProcessingResource)  
  Factory.createResource("gate.creole.POSTagger", params);

Note that if the resource created here had any parameters that were both mandatory and had no default value, the createResource call would throw an exception. In this case, all the information needed to create a tagger is available in default values given in the tagger’s XML definition (in plugins/ANNIE/creole.xml):

<RESOURCE>  
  <NAME>ANNIE POS Tagger</NAME>  
  <COMMENT>Mark Hepple’s Brill-style POS tagger</COMMENT>  
  <CLASS>gate.creole.POSTagger</CLASS>  
  <PARAMETER NAME="document"  
    COMMENT="The document to be processed"  
    RUNTIME="true">gate.Document</PARAMETER>  
....  
  <PARAMETER NAME="rulesURL" DEFAULT="resources/heptag/ruleset"  
    COMMENT="The URL for the ruleset file"  
    OPTIONAL="true">java.net.URL</PARAMETER>  
</RESOURCE>

Here the two parameters shown are either ‘runtime’ parameters, which are set before a PR is executed, or have a default value (in this case the default rules file is distributed with GATE itself).

When creating a Document, however, the URL of the source for the document must be provided4. For example:

URL u = new URL("http://gate.ac.uk/hamish/");  
FeatureMap params = Factory.newFeatureMap();  
params.put("sourceUrl", u);  
Document doc = (Document)  
  Factory.createResource("gate.corpora.DocumentImpl", params);

The document created here is transient: when you quit the JVM the document will no longer exist. If you want the document to be persistent, you need to store it in a DataStore. Assuming that you have a DataStore already open called myDataStore, this code will ask the data store to take over persistence of your document, and to synchronise the memory representation of the document with the disk storage:

Document persistentDoc = myDataStore.adopt(doc, mySecurity);  
myDataStore.sync(persistentDoc);

Security:
User access to the LRs is provided by a security mechanism of users and groups, similar to those on an operating system. When users create/save LRs into Oracle, they specify reading and writing access rights for users from their group and other users. For example, LRs created by one user/group can be made read-only to others, so they can use the data, but not modify it. The access modes are:

If needed, ownership can be transferred from one user to another. Users, groups and LR permissions are administered in a special administration tool, by a privileged user. For more details see chapter 14.

When you want to restore a document (or other LR) from a data store, you make the same createResource call to the Factory as for the creation of a transient resource, but this time you tell it the data store the resource came from, and the ID of the resource in that datastore:

  URL u = ....; // URL of a serial data store directory  
  SerialDataStore sds = new SerialDataStore(u.toString());  
  sds.open();  
 
  // getLrIds returns a list of LR Ids, so we get the first one  
  Object lrId = sds.getLrIds("gate.corpora.DocumentImpl").get(0);  
 
  // we need to tell the factory about the LR’s ID in the data  
  // store, and about which data store it is in - we do this  
  // via a feature map:  
  FeatureMap features = Factory.newFeatureMap();  
  features.put(DataStore.LR_ID_FEATURE_NAME, lrId);  
  features.put(DataStore.DATASTORE_FEATURE_NAME, sds);  
 
  // read the document back  
  Document doc = (Document)  
    Factory.createResource("gate.corpora.DocumentImpl", features);

See the example code at http://gate.ac.uk/gate-examples/doc/.

3.12 [D] Load Resources: document, tokenizer...* [#]

3.12.1 Loading Language Resources: document, corpora... [#]

Load a language resource by right clicking on “Language Resources” in the resource tree and selecting a language resource type (document, corpus or annotation schema) or by going to the ’File’ menu and choosing ’New Language Resource’. Choose optionally a name for the resource, and choose any parameters as necessary.

For a document, a file or URL should be entered as the value of “sourceUrl” (double clicking in the “values” box brings up a tree structure to enable selection of documents). The easiest for a start is to enter an URL like ’http://gate.ac.uk’. Other parameters can be selected or changed as necessary, such as the encoding of the document, and whether it should be markup aware.

See also the movie for creating documents.

There are three ways of adding documents to a corpus:

  1. When creating the corpus, clicking on the icon under Value brings up a popup window with a list of the documents already loaded into GATE. This enables the user to add any documents to the corpus.
  2. Alternatively, the corpus can be loaded first, and documents added later by double clicking on the corpus and using the + and - icons to add or remove documents to the corpus. Note that the documents must have been loaded into GATE before they can be added to the corpus.
  3. Once loaded, the corpus can be populated by right clicking on the corpus and selecting “Populate”. With this method, documents do not have to have been previously loaded into GATE, as they will be loaded during the population process. Select the directory containing the relevant files, choose the encoding, and check or uncheck the “recurse directories” box as appropriate. The initial value for the encoding is the platform default.

Additionally, right-clicking on a loaded document in the tree and selecting the “New corpus with this document” option creates a new transient corpus named Corpus for document name containing just this document. To add a new annotation schema, simply choose the name and the path or Url. For more information about schema, see 6.4.1.

See also the movie for creating and populating corpora.

You should next read the section 3.16 to view annotations.

3.12.2 Loading Processing Resources: tokenizer, gazetteer... [#]

This section describes how to load and run CREOLE resources not present in ANNIE. To load ANNIE, see Section 3.17. For technical descriptions of these resources, see Chapter 9. First ensure that the necessary plugins have been loaded (see Section 3.4). If the resource you require does not appear in the list of Processing Resources, then you probably do not have the necessary plugin loaded. Processing resources are loaded by selecting them from the set of Processing Resources (right click on Processing Resources or select “New Processing Resource” from the File menu), adding them to the application and selecting the necessary parameters (e.g. input and output Annotation Sets).

See also the movie for loading processing resources.

You should next read the section 3.14 to create and run an application.

3.12.3 Loading and Processing Large Corpora [#]

When trying to process a larger corpus (i.e. one that would not fit in memory at one time) the use of a datastore and persistent corpora is required.

Open or create a datastore (see section 3.22) and then create a corpus. Save the so far empty corpus to the datastore – this will convert it to a persistent corpus.

When populating or processing the persistent corpus, the documents contained will only be loaded one by one thus reducing the amount of memory required to only that necessary for loading the largest document in the collection.

3.13 [D,F] Configure (CREOLE) Resources [#]

For full details on how to supply configuration data for resources can be found in section 4.9.

3.14 [D] Create and Run an Application* [#]

Once all the resources have been loaded, an application can be created and run. Right click on “Applications” and select “New” and then either “Corpus Pipeline” or “Pipeline”. A pipeline application can only be run over a single document, while a corpus pipeline can be run over a whole corpus.

To build the pipeline, double click on it, and select the resources needed to run the application (you may not necessarily wish to use all those which have been loaded). Transfer the necessary components from the set of “loaded components” displayed on the left hand side of the main window to the set of “selected components” on the right, by selecting each component and clicking on the left and right arrows, or by double-clicking on each component. Ensure that the components selected are listed in the correct order for processing (starting from the top). If not, select a component and move it up or down the list using the up/down arrows at the left side of the pane. Ensure that any parameters necessary are set for each processing resource (by clicking on the resource from the list of selected resources and checking the relevant parameters from the pane below). For example, if you wish to use annotation sets other than the Default one, these must be defined for each processing resource. Note that if a corpus pipeline is used, the corpus needs only to be set once, using the drop-down menu beside the “corpus” box. If a pipeline is used, the document must be selected for each processing resource used. Finally, right-click on “Run” to run the application on the document or corpus.

See also the movie for loading and running processing resources.

For how to use the conditional versions of the pipelines see section 3.15 and for saving/restoring the configuration of an application see section 3.23.

You should next read the section 3.17 to do information extraction.

3.15 [D] Run PRs Conditionally on Document Features [#]

The “Conditional Pipeline” and “Conditional Corpus Pipeline” application types are conditional versions of the pipelines mentioned in section 3.14 and allow processing resources to be run or not according to the value of a feature on the document. In terms of graphical interface, the only addition brought by the conditional versions of the applications is a box situated underneath the lists of available and selected resources which allows the user to choose whether the currently selected processing resource will run always, never or only on the documents that have a particular value for a named feature.

If the Yes option is selected then the corresponding resource will be run on all the documents processed by the application as in the case of non- conditional applications. If the No option is selected then the corresponding resource will never be run; the application will simply ignore its presence. This option can be used to temporarily and quickly disable an application component, for debugging purposes for example.

The If value of feature option permits running specific application components conditionally on document features. When selected, this option enables two text input fields that are used to enter the name of a feature and the value of that feature for which the corresponding processing resource will be run. When a conditional application is run over a document, for each component that has an associated condition, the value of the named feature is checked on the document and the component will only be used if the value entered by the user matches the one contained in the document features.

3.16 [D] View Annotations* [#]

If you have no document already loaded in the resource tree, see first section 3.12.

To view a document, double click on the filename in the resource tree (left hand pane). Note that it may take a few seconds for the text to be displayed if it is a big document and/or with a lot of annotations.

To view the annotation sets, click on the ’Annotation Sets’ button at the top of the document view or use F3 key. If you keep Shift key pressed when doing so the annotations will be selected as the last document viewed otherwise no annotation will be selected. This will bring up the annotation sets viewer, which displays the annotation sets available and their corresponding annotation types. Note that the default annotation set has no name. If no application has been run, the only annotations to be displayed will be those corresponding to the document format analysis performed automatically by GATE on loading the document (e.g. HTML or XML tags). If an application has been run, other annotation types and/or annotation sets may also be present. The fonts and colours of the annotations can be edited by double clicking on the annotation name or pressing Enter key.

Select the annotation types to be viewed by clicking on the appropriate checkbox(es) or pressing Space key. The text segments corresponding to these annotations will be highlighted in the main text window.

To view the annotations and their features, click on the ’Annotations list’ button at the top or bottom of the main window or use F4 key. The annotation list viewer will appear above or below the main text, respectively. It will only contain the annotations selected from the annotation sets. These lists can be sorted in ascending and descending order by any column, by clicking on the corresponding column heading. Moreover you can hide a column by using the context menu with right-click. Clicking on an entry in the table will also highlight the respective matching text portion.

Hovering over some part of the text in the main window will bring up a popup box containing a list of the annotations associated with it (assuming that the relevant annotation types have been selected from the annotation set viewer).

Annotations relating to coreference (if relevant) are displayed separately in the coreference viewer. This operates in the same way as the annotation sets viewer.

At any time, the main viewer can also be used to display other information, such as Messages, by clicking on the header at the top of the main window. If an error occurs in processing, the messages tab will flash red, and an additional popup error message may also occur.

Text in a loaded document can be edited in the document viewer. The usual platform specific cut, copy and paste keyboard shortcuts should also work, depending on your operating system (e.g. CTRL-C, CTRL-V for Windows). The last icon, a magnifying glass, at the top of the document editor is for searching in the document. To prevent the new annotation windows popping up when a piece of text is selected, hide the AnnotationSets view (the tree on the right) first to make it inactive. The highlighted portions of the text will still remain visible.

See also the movie for inspecting the processing results.

You should next read the section 3.19 to create and edit annotations.

3.17 [D] Do Information Extraction with ANNIE* [#]

This section describes how to load and run ANNIE (see Chapter 8) from the development environment. To embed ANNIE in other software, see section 3.28.

From the File menu, select “Load ANNIE system”. To run it in its default state, choose “With Defaults”. This will automatically load all the ANNIE resources, and create a corpus pipeline called ANNIE with the correct resources selected in the right order, and the default input and output annotation sets.

If “Without Defaults” is selected, the same processing resources will be loaded, but a popup window will appear for each resource, which enables the user to specify a name and location for the resource. This is exactly the same procedure as for loading a processing resource individually, the difference being that the system automatically selects those resources contained within ANNIE. When the resources have been loaded, a corpus pipeline called ANNIE will be created as before.

The next step is to add a corpus (see Section 3.12.1), and select this corpus from the drop-down Corpus menu in the Serial Application editor. Finally click on Run (from the Serial Application editor, or by right clicking on the application name and selecting “Run”). To view the results, double click on the filename in the left hand pane. No Annotation Sets nor Annotations will be shown until annotations are selected in the Annotation Sets; the Default set is indicated only with an unlabelled right-arrowhead which must be selected in order to make visible the available annotations.

See also the movie for loading and running ANNIE.

3.18 [D] Modify ANNIE [#]

You will find the ANNIE resources in gate/plugins/ANNIE/resources. Simply locate the existing resources you want to modify, make a copy with a new name, edit them, and load the new resources into GATE as new Processing Resources (see Section 3.12.2).

3.19 [D] Create and Edit Annotations* [#]

Since many NLP algorithms require annotated corpora for training, GATE’s development environment provides easy-to-use and extendable facilities for text annotation. The annotation can be done manually by the user or semi-automatically by running some processing resources over the corpus and then correcting/adding new annotations manually. Depending on the information that needs to be annotated, some ANNIE modules can be used or adapted to bootstrap the corpus annotation task.

To create annotations manually:

Then, to edit annotations:

The popup menu only contains annotation types present in the Annotation Schema and those already listed in the relevant Annotation Set. To create a new Annotation Schema, see Section 3.21. The popup menu can be edited to add a new annotation type, however.

The new annotation created will automatically be placed in the annotation set that has been selected (highlighted) by the user. To create a new annotation set, type the name of the new set to be created in the box below the list of annotation sets, and click on ”New”.

Figure 3.4 demonstrates adding a ’Organization’ annotation for the string “EPSRC” (highlighted in green) to the default annotation set (blank name in the annotation set view on the right) and a feature name ’type’ with a value about to be added.


PIC


Figure 3.4: Adding an Organization annotation to the Default Annotation Set


To add a second annotation to a selected piece of text, or to add an overlapping annotation to an existing one, press the CTRL key to avoid the existing annotation popup appearing, and then select the text and create the new annotation. Again by default the last annotation type to have been used will be displayed; change this to the new annotation type. When a piece of text has more than one annotation associated with it, on mouseover all the annotations will be displayed. Selecting one of them will bring up the relevant annotation popup.


PIC


Figure 3.5: Search and annotate function of the annotation editor.


To search and annotate automatically the document use the search and annotate function like shown on figure 3.5:

Note that after using the [First] button you can move the caret in the document and use the [Next] button to avoid continuing the search from the beginning of the document. The [?] button at the end of the search text field will help you to build powerful regular expressions to search.

You should next read the section 3.20 to save annotations.

3.19.1 Schema-driven editing [#]

An alternative annotation editor component is available which constrains the available annotation types and features much more tightly, based on the annotation schemas that are currently loaded. This is particularly useful when annotating large quantities of data or for use by less skilled users.

Annotation schemas provide a means to define types of annotations in GATE - basically this means that GATE ”knows about” annotations defined in a schema.

The default annotation schema contains common named entities such as Person, Organisation, Location, etc. You can modify the existing schema or create a new one, in order to tell GATE about other kinds of annotations you frequently use. You can still create annotations in GATE without having specified them in an annotation schema, but you may then need to tell GATE about the properties of that annotation type each time you create an annotation for it.

To use this, you must load the Schema_Annotation_Editor plugin. With this plugin loaded, the annotation editor will only offer the annotation types permitted by the currently loaded set of schemas, and when you select an annotation type only the features permitted by the schema are available to edit5. Where a feature is declared as having an enumerated type the available enumeration values are presented as an array of buttons, making it easy to select the required value quickly.

To load an annotation schema use the resource tree and right-click on Language Resources or use the File menu then New language resource item.

See Section 3.21 for creating new annotation schemas.

3.20 [D] Saving annotations* [#]

The data can either be dumped out as a file (see Section 3.33) or saved in a data store (see Section 3.22).

3.21 [D,F] Create a New Annotation Schema [#]

GUI

An annotation schema file can be loaded or unloaded in GATE just like any other language resource. Once loaded into the system, the Schema Annotation Editor will use this definition when creating or editing annotations.

API

Another way to bring an annotation schema inside GATE is through creole.xml file. By using the AUTOINSTANCE element, one can create instances of resources defined in creole.xml. The gate.creole.AnnotationSchema (which is the Java representation of an annotation schema file) initializes with some predefined annotation definitions (annotation schemas) as specified by the GATE team.

Example from GATE’s internal creole.xml (in src/gate/resources/creole):

<!-- Annotation schema -->  
<RESOURCE>  
  <NAME>Annotation schema</NAME>  
  <CLASS>gate.creole.AnnotationSchema</CLASS>  
  <COMMENT>An annotation type and its features</COMMENT>  
  <PARAMETER NAME="xmlFileUrl" COMMENT="The url to the definition file"  
    SUFFIXES="xml;xsd">java.net.URL</PARAMETER>  
  <AUTOINSTANCE>  
    <PARAM NAME ="xmlFileUrl" VALUE="schema/AddressSchema.xml" />  
  </AUTOINSTANCE>  
  <AUTOINSTANCE>  
    <PARAM NAME ="xmlFileUrl" VALUE="schema/DateSchema.xml" />  
  </AUTOINSTANCE>  
  <AUTOINSTANCE>  
    <PARAM NAME ="xmlFileUrl" VALUE="schema/FacilitySchema.xml" />  
  </AUTOINSTANCE>  
  <!-- etc. -->  
</RESOURCE>

In order to create a gate.creole.AnnotationSchema object from a schema annotation file, one must use the gate.Factory class.

Eg:
FeatureMap params = new FeatureMap();
param.put("xmlFileUrl",annotSchemaFile.toURL());
AnnotationSchema annotSchema =
Factory.createResurce("gate.creole.AnnotationSchema", params);

Note: All the elements and their values must be written in lower case, as XML is defined as case sensitive and the parser used for XML Schema inside GATE searches is case sensitive.

In order to be able to write XML Schema definitions, the ones defined in GATE (resources/creole/schema) can be used as a model, or the user can have a look at http://www.w3.org/2000/10/XMLSchema for a proper description of the semantics of the elements used.

Some examples of annotation schemas are given in Section 6.4.1.

3.22 [D] Save and Restore LRs in Data Stores [#]

To save a text in a data store, a new data store must first be created if one does not already exist. Create a data store by right clicking on Data Store in the left hand pane, and select the option ”Create Data Store”. Select the data store type you wish to use. Create a directory to be used as the data store (note that the data store is a directory and not a file).

You can either save a whole corpus to the datastore (in which case the structure of the corpus will be preserved) or you can save individual documents. The recommended method is to save the whole corpus. To save a corpus, right click on the corpus name and select the ”Save to...” option (giving the name of the datastore created earlier). To save individual documents to the data store, right clicking on each document name and follow the same procedure.

To load a document from a data store, do not try to load it as a language resource. Instead, open the data store by right clicking on Data Store in the left hand pane, select “Open Data Store” and choose the data store to open. The data store tree will appear in the main window. Double click on a corpus or document in this tree to open it. To save a corpus and document back to the same datastore, simply select the ”Save” option.

See also the movie for creating a data store and the movie for loading corpus and documents from a data store.

3.23 [D] Save Resource Parameter State to File [#]

Resources, and applications that are made up of them, are created based on the settings of their parameters (see section 3.12). It is possible to save the data used to create an application to a file and re-load it later. To save the application to a file, right click on it in the resources tree and select “Save application state”, which will give you a file creation dialogue.

To restore the application later, select “Restore application from file” from the “File” menu.

Note that the data that is saved represents how to recreate an application – not the resources that make up the application itself. So, for example, if your application has a resource that initialises itself from some file (e.g. a grammar, a document) then that file must still exist when you restore the application.

In case you don’t want to save the corpus configuration associated with the application then you must select ’<none>’ in the corpus list of the application before to save the application.

The file resulted from saving the application state contains the values of the initialisation parameters for all the processing resources contained by the stored application. For the parameters of type URL (which are typically used to select external resources such as grammars or rules files) a transformation is applied so that all the paths are relative to the location of the file used to store the state. This means that the resource files used by an application do not need to be in the same location as when the application was initially created but rather in the same location relative to the location of the application file. This allows the creation and deployment of portable applications by keeping the application file and the resource files used by the application together.

If you want to save your application along with all the resources it requires you can use the “Export for Teamware” option (see section 3.24).

See also the movie for saving and restoring applications.

3.24 [D] Save an application with its resources (e.g. GATE Teamware) [#]

When you save an application using the “Save application state” option (see section 3.23), the saved file contains references to the plugins that were loaded when the application was saved, and to any resource files required by the application. To be able to reload the file, these plugins and other dependencies must exist at the same locations (relative to the saved state file). While this is fine for saving and loading applications on a single machine it means that if you want to package your application to run it elsewhere (e.g. deploy it to a GATE Teamware installation) then you need to be careful to include all the resource files and plugins at the right locations in your package. The “Export for Teamware” option on the right-click menu for an application helps to automate this process.

When you export an application in this way, GATE produces a ZIP file containing the saved application state (in the same format as “Save application state”). Any plugins and resource files that the application refers to are also included in the zip file, and the relative paths in the saved state are rewritten to point to the correct locations within the package. The resulting package is therefore self-contained and can be copied to another machine and unpacked there, or passed to your Teamware Administrator for deployment.

As well as selecting the location where you want to save the package, the “Export for Teamware” option will also prompt you to select the annotation sets that your application uses for input and output. For example, if your application makes use of the unpacked XML markup in source documents and creates annotations in the default set then you would select “Original markups” as an input set and the “<Default annotation set>” as an output set. GATE will try to make an educated guess at the correct sets but you should check and amend the lists as necessary.

There are a few important points to note about the export process:

If you require more flexibility than this option provides you should read section C.2, which describes the underlying Ant task that the exporter uses.

3.25 [D,F] Perform Evaluation with the AnnotationDiff tool [#]

Section 13 describes the theory behind this tool.

The annotation tool is activated by selecting it from the Tools menu at the top of the window. It will appear in a new window. Select the key and response documents to be used (note that both must have been previously loaded into the system), the annotation sets to be used for each, and the annotation type to be evaluated.

Note that the tool automatically intersects all the annotation types from the selected key annotation set with all types from the response set.

On a separate note, you can perform a diff on the same document, between two different annotation sets. One annotation set could contain the key type and another could contain the response one.

After the type has been selected, the user is required to decide how the features will be compared. It is important to know that the tool compares them by analyzing if features from the key set are contained in the response set. It checks for both the feature name and feature value to be the same.

There are three basic options to select:

If false positives are to be measured, select the annotation type (and relevant annotation set) to be used as the denominator (normally, Token or Sentence). The weight for the F-Measure can also be changed - by default it is set to 0.5 (i.e. to give precision and recall equal weight). Finally, click on “Evaluate” to display the results. Note that the window may need to be resized manually, by dragging the window edges or internal bars as appropriate).

In the main window, the key and response annotations will be displayed. They can be sorted by any category by clicking on the relevant column header. The key and response annotations will be aligned if their indices are identical, and are color coded according to the legend displayed.

Precision, recall, F-measure and false positives are also displayed below the annotation tables, each according to 3 criteria - strict, lenient and average. See sections 13.1 and 13.4 for more details about the evaluation metrics.

The results can be saves to an HTML file by pressing the ”Export to HTML” button. This creates an HTML snapshot of what the AnnotationDiff interface shows at that moment.The columns and rows in the table will be shown in the same order, and the hidden columns will not appear in the HTML file. The colours will also be the same.

3.26 [D] Use the Corpus Benchmark Evaluation tool [#]

The Corpus Benchmark tool can be run in two ways: standalone and GUI mode. Section 13.3 describes the theory behind this tool.

3.26.1 GUI mode

To use the tool in GUI mode, first make sure the properties of the tool have been set correctly (see section 3.26.2 for how to do this). Then select “Corpus Benchmark Tool” from the Options menu. There are 3 ways in which it can be run:

Once the mode has been selected, choose the directory where the corpus is to be found. The corpus must have a directory structure consisting of “clean” and “marked” subdirectories (note that these names are case sensitive). The clean directory should contain the raw texts; the marked directory should contain the human-annotated texts. Finally, select the application to be run on the corpus (for “default” and “human v current” modes).

If the tool is to be used in Default or Current mode, the corpus must first be processed with the current set of resources. This is done by selecting “Store corpus for future evaluation” from the Corpus Benchmark Tool. Select the corpus to be processed (from the top of the subdirectory structure, i.e. the directory containing the marked and stored subdirectories). If a “processed” subdirectory exists, the results will be placed there; if not, one will be created.

Once the corpus has been processed, the tool can be run in Default or Current mode. The resulting HTML file will be output in the main GATE messages window. This can then be pasted into a text editor and viewed in an Internet browser for easier viewing.

The tool can be used either in verbose or non-verbose mode, by selecting the verbose option from the menu. In verbose mode, any score below the user’s pre-defined threshold (stored in corpus_tool.properties file) will show the relevant annotations for that entity type, thereby enabling the user to see where problems are occurring.

3.26.2 How to define the properties of the benchmark tool [#]

The properties of the benchmark tool are defined in the file corpus_tool.properties, which should be located in the directory from which GATE is run (usually gate/build or gate/bin).

The following properties should be set:

The default Annotation Set has to be represented by an empty String. Note also that outputSetName and annotSetName must be different. If they are the same, then use the Annotation Set Transfer PR to change one of them.

An example file is shown below:

threshold=0.7  
annotSetName=Key  
outputSetName=ANNIE  
annotTypes=Person;Organization;Location;Date;Address;Money  
annotFeatures=type;gender

3.27 [D] Write JAPE Grammars [#]

JAPE is a language for writing regular expressions over annotations, and for using patterns matched in this way as the basis for creating more annotations. JAPE rules compile into finite state machines. GATE’s built-in Information Extraction tools use JAPE (amongst other things). For information on JAPE see:

3.28 [F] Embed NLE in other Applications [#]

Embedding GATE-based language processing in other applications is straightforward:

For example, this code will create the ANNIE extraction system:

  // initialise the GATE library  
  Gate.init();  
 
  // load ANNIE as an application from a gapp file  
  SerialAnalyserController controller = (SerialAnalyserController)  
    PersistenceManager.loadObjectFromFile(new File(new File(  
      Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR),  
        ANNIEConstants.DEFAULT_FILE));

If you want to use resources from any plugins, you need to load the plugins before calling createResource:

  Gate.init();  
 
  // need Tools plugin for the Morphological analyser  
  Gate.getCreoleRegister().registerDirectories(  
    new File(Gate.getPluginsHome(), "Tools").toURL()  
  );  
 
  ...  
 
  ProcessingResource morpher = (ProcessingResource)  
    Factory.createResource("gate.creole.morph.Morph");

Instead of creating your processing resources individually using the Factory, you can create your application in the GUI, save it using the “save application state” option (see section 3.23), and then load the saved state from your code. This will automatically reload any plugins that were loaded when the state was saved, you do not need to load them manually.

  Gate.init();  
 
  CorpusController controller = (CorpusController)  
    PersistenceManager.loadObjectFromFile(new File("savedState.xgapp"));  
 
  // loadObjectFromUrl is also available

There are longer examples available at http://gate.ac.uk/gate-examples/doc/.

3.29 [F] Use GATE within a Spring application [#]

GATE provides helper classes to allow GATE resources to be created and managed by the Spring framework. For Spring 2.0 or later, GATE provides a custom namespace handler that makes them extremely easy to use. To use this namespace, put the following declarations in your bean definition file:

<beans xmlns="http://www.springframework.org/schema/beans"  
       xmlns:gate="http://gate.ac.uk/ns/spring"  
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
       xsi:schemaLocation="  
         http://www.springframework.org/schema/beans  
         http://www.springframework.org/schema/beans/spring-beans.xsd  
         http://gate.ac.uk/ns/spring  
         http://gate.ac.uk/ns/spring.xsd">

You can have Spring initialise GATE:

  <gate:init gate-home="WEB-INF" user-config-file="WEB-INF/user.xml">  
    <gate:preload-plugins>  
      <value>WEB-INF/ANNIE</value>  
      <value>http://example.org/gate-plugin</value>  
    </gate:preload-plugins>  
  </gate:init>

To create a GATE resource, use the <gate:resource> element.

  <gate:resource id="sharedOntology" scope="singleton"  
          resource-class="gate.creole.ontology.owlim.OWLIMOntologyLR">  
    <gate:parameters>  
      <entry key="rdfXmlURL">  
        <value type="org.springframework.core.io.Resource"  
          >WEB-INF/ontology.rdf</value>  
      </entry>  
    </gate:parameters>  
  </gate:resource>

If you are familiar with Spring you will see that <gate:parameters> uses the same format as the standard <map> element, but values whose type is a Spring Resource will be converted to URLs before being passed to the GATE resource.

You can load a GATE saved application with

  <gate:saved-application location="WEB-INF/application.gapp" scope="prototype">  
    <gate:customisers>  
      <gate:set-parameter pr-name="custom transducer" name="ontology"  
                          ref="sharedOntology" />  
    </gate:customisers>  
  </gate:saved-application>

”Customisers” are used to customise the application after it is loaded. In the example above, we load a singleton copy of an ontology which is then shared between all the separate instances of the (prototype) application. The <gate:set-parameter> customiser accepts all the same ways to provide a value as the standard Spring <property> element (a ”value” or ”ref” attribute, or a sub-element - <value>, <list>, <bean>, <gate:resource> …).

The <gate:add-pr> customiser provides support for the case where most of the application is in a saved state, but we want to create one or two extra PRs with Spring (maybe to inject other Spring beans as init parameters) and add them to the pipeline.

  <gate:saved-application ...>  
    <gate:customisers>  
      <gate:add-pr add-before="OrthoMatcher" ref="myPr" />  
    </gate:customisers>  
  </gate:saved-application>

By default, the <gate:add-pr> customiser adds the target PR at the end of the pipeline, but an add-before or add-after attribute can be used to specify the name of a PR before (or after) which this PR should be placed. Alternatively, an index attribute places the PR at a specific (0-based) index into the pipeline. The PR to add can be specified either as a ”ref” attribute, or with a nested <bean> or <gate:resource> element.

These custom elements all define various factory beans. For full details, see the JavaDocs for gate.util.spring (the factory beans) and gate.util.spring.xml (the gate: namespace handler).

Note: the former approach using factory methods of the gate.util.spring.SpringFactory class will still work, but should be considered deprecated in favour of the new factory beans.

3.30 [F] Use GATE within a Tomcat Web Application [#]

Embedding GATE in a Tomcat web application involves several steps.

  1. Put the necessary JAR files (gate.jar and all or most of the jars in gate/lib) in your webapp/WEB-INF/lib.
  2. Put the plugins that your application depends on in a suitable location (e.g. webapp/WEB-INF/plugins).
  3. Create suitable gate.xml configuration files for your environment.
  4. Set the appropriate paths in your application before calling Gate.init().

This process is detailed in the following sections.

3.30.1 Recommended Directory Structure

You will need to create a number of other files in your web application to allow GATE to work:

In this guide, we assume the following layout:

webapp/  
  WEB-INF/  
    gate.xml  
    user-gate.xml  
    plugins/  
      ANNIE/  
      etc.

3.30.2 Configuration files

Your gate.xml (the “site-wide configuration file”) should be as simple as possible:

<?xml version="1.0" encoding="UTF-8" ?>  
<GATE>  
  <GATECONFIG Save_options_on_exit="false"  
              Save_session_on_exit="false" />  
</GATE>

Similarly, keep the user-gate.xml (the “user config file”) simple:

<?xml version="1.0" encoding="UTF-8" ?>  
<GATE>  
  <GATECONFIG Known_plugin_path=";"  
              Load_plugin_path=";" />  
</GATE>

This way, you can control exactly which plugins are loaded in your webapp code.

3.30.3 Initialization code

Given the directory structure shown above, you can initialize GATE in your web application like this:

// imports  
...  
public class MyServlet extends HttpServlet {  
  private static boolean gateInited = false;  
 
  public void init() throws ServletException {  
    if(!gateInited) {  
      try {  
        ServletContext ctx = getServletContext();  
 
        // use /path/to/your/webapp/WEB-INF as gate.home  
        File gateHome = new File(ctx.getRealPath("/WEB-INF"));  
 
        Gate.setGateHome(gateHome);  
        // thus webapp/WEB-INF/plugins is the plugins directory, and  
        // webapp/WEB-INF/gate.xml is the site config file.  
 
        // Use webapp/WEB-INF/user-gate.xml as the user config file, to avoid  
        // confusion with your own user config.  
        Gate.setUserConfigFile(new File(gateHome, "user-gate.xml"));  
 
        Gate.init();  
        // load plugins, for example...  
        Gate.getCreoleRegister().registerDirectories(  
          ctx.getResource("/WEB-INF/plugins/ANNIE"));  
 
        gateInited = true;  
      }  
      catch(Exception ex) {  
        throw new ServletException("Exception initialising GATE",  
                                   ex);  
      }  
    }  
  }  
}

Once initialized, you can create GATE resources using the Factory in the usual way (for example, see section 3.28 for an example of how to create an ANNIE application). You should also read section 3.31 for important notes on using GATE in a multithreaded application.

Instead of an initialization servlet you could also consider doing your initialization in a ServletContextListener, or using Spring (see section 3.29).

3.31 [F] Use GATE in a Multithreaded Environment [#]

GATE can be used in multithreaded applications, so long as you observe a few restrictions. First, you must initialise GATE by calling Gate.init() exactly once in your application, typically in the application startup phase before any concurrent processing threads are started.

Secondly, you must not make calls that affect the global state of GATE (e.g. loading or unloading plugins) in more than one thread at a time. Again, you would typically load all the plugins your application requires at initialisation time. It is safe to create instances of resources in multiple threads concurrently.

Thirdly, it is important to note that individual GATE processing resources, language resources and controllers are by design not thread safe – it is not possible to use a single instance of a controller/PR/LR in multiple threads at the same time – but for a well written resource it should be possible to use several different instances of the same resource at once, each in a different thread. When writing your own resource classes you should bear the following in mind, to ensure that your resource will be useable in this way.

Of course, if you are writing a PR that is simply a wrapper around an external library that imposes these kinds of limitations there is only so much you can do. If your resource cannot be made safe you should document this fact clearly.

All the standard ANNIE PRs are safe when independent instances are used in different threads concurrently, as are the standard transient document, transient corpus and controller classes. A typical pattern of development for a multithreaded GATE-based application is:

3.32 [D,F] Add support for a new document format [#]

In order to add a new document format, one needs to extend the gate.DocumentFormat class and to implement an abstract method called:

public void unpackMarkup(Document doc) throws
DocumentFormatException

This method is supposed to implement the functionality of each format reader and to create annotation on the document. Finally the document’s old content will be replaced with a new one containing only the text between markups (see the GATE API documentation for more details on this method functionality).

If one needs to add a new textual reader will extend the gate.corpora. TextualDocumentFormat and override the unpackMarkup(doc) method.

This class needs to be implemented under the Java bean specifications because it will be instantiated by GATE using Factory.createResource() method.

The init() method that one needs to add and implement is very important because in here the reader defines its means to be selected successfully by GATE. What one need to do is to add some specific information into certain static maps defined in DocumentFormat class, that will be used at reader detection time.

After that, a definition of the reader will be placed into the one’s creole.xml file and the reader will be available to GATE.

We present for the rest of the section a complete three steps example of adding such a reader. The reader we describe in here is an XML reader.

Step 1

Create a new class called XmlDocumentFormat that extends
gate.corpora.TextualDocumentFormat.

Step 2

Implement the unpackMarkup(Document doc) which performs the required functionality for the reader. Add XML detection means in init() method:

public Resource init() throws ResourceInstantiationException{  
  // Register XML mime type  
  MimeType mime = new MimeType("text","xml");  
  // Register the class handler for this mime type  
  mimeString2ClassHandlerMap.put(mime.getType()+ "/" + mime.getSubtype(),  
                                                                        this);  
  // Register the mime type with mine string  
  mimeString2mimeTypeMap.put(mime.getType() + "/" + mime.getSubtype(), mime);  
  // Register file sufixes for this mime type  
  suffixes2mimeTypeMap.put("xml",mime);  
  suffixes2mimeTypeMap.put("xhtm",mime);  
  suffixes2mimeTypeMap.put("xhtml",mime);  
  // Register magic numbers for this mime type  
  magic2mimeTypeMap.put("<?xml",mime);  
  // Set the mimeType for this language resource  
  setMimeType(mime);  
  return this;  
}// init()

More details about the information from those maps can be found in Section 6.5.1

Step 3

Add the following creole definition in the creole.xml document.

    <RESOURCE>  
      <NAME>My XML Document Format</NAME>  
      <CLASS>mypackage.XmlDocumentFormat</CLASS>  
      <AUTOINSTANCE/>  
      <PRIVATE/>  
    </RESOURCE>

More information on the operation of GATE’s document format analysers may be found in section 6.5.

3.33 [D] Dump Results to File [#]

There are three main ways to dump out the results of, for example, some language analysis or Information Extraction process running over documents:

  1. preserving the original document format, with optional added annotations;
  2. in GATE’s own XML serialisation format (including all the annotations on the document);
  3. by writing your own dump algorithm as a ProcessingResource.

This section describes how to use the first two options.

Both types of data export are available in the popup menu triggered by right-clicking on a document in the resources tree (see section 3.6): type 1 is called ‘save preserving format’ and type 2 is called ‘save as XML’.

Selecting the save as XML option leads to a file open dialogue; give the name of the file you want to create, and the whole document and all its data will be exported to that file. If you later create a document from that file, the state will be restored. (Note: because GATE’s annotation model is richer than that of XML, and because our XML dump implementation sometimes cuts corners6, the state may not be identical after restoration. If your intention is to store the state for later use, use a DataStore instead.)

The save preserving format option also leads to a file dialogue; give a name and the data you require will be dumped into the file. The difference is that the file will preserve the original format of the source document. You can add annotations to the dump file: if there is a document viewer open in the main resource viewer area (see section 3.6), then any annotations that are selected (i.e. are visible in the table at the bottom of the viewer) will be included in the output dump. This is the best way to use the system to add markup based on some analysis process: select those annotations in the document viewer, save preserving format and you will have a file identical to the original source document with just the annotations you selected added. By default, the added annotations will contain no feature data; if you want the process to also dump features, set the ‘Include annotation features...’ option in the advanced options dialogue (see section 3.7). Note that GATE’s model of annotation allows graph structures, which are difficult to represent in XML (XML is a tree-structured representation format). During the dump process, annotations that cross each other in ways that can’t be represented straightforwardly in XML will be discarded, and a warning message printed.

3.34 [D] Stop GUI ‘Freezing’ on Linux [#]

There is a problem with some versions of Linux that causes the GUI to appear to freeze. The problem occurs when you take some action, like loading a resource or browsing for a file, that pops up a dialogue box. This box sometimes fails to appear in a visible area of the screen, at which point the rest of the GUI waits for you to do something intelligent with the dialogue box, while you wait for the GUI to do something. This is an excellent feature for those without tight deadlines to meet, and the best solution is to stop work and go home for a long while. Alternatively, you can play ‘hunt the dialogue box’.

This feature is available totally free of charge.

3.35 [D] Stop GUI Crashing on Linux [#]

On some configurations of Red Hat 7.0 the GUI crashes on startup. The solution is to limit the initial stack size prior to launch: ulimit -s 2048.

3.36 [D] Stop GATE Restoring GUI Sessions/Options [#]

GATE will remember GUI options and the state of the resource tree when it exits. The options are saved by default; the session state is not saved by default. This default behaviour can be changed from the “Advanced” tab of the “Configuration” choice on the “Options” menu.

If a problem occurs and the saved data prevents GATE from starting, you can fix it by deleting the configuration and session data files. These are stored in your home directory, and are called gate.xml and gate.sesssion or .gate.xml and .gate.sesssion depending on platform. On Windoze your home is:

95, 98, NT:
Windows Directory/profiles/username
2000, XP:
Windows Drive/Documents and Settings/username

3.37 Work with Unicode [#]

GATE provides various facilities for working with Unicode beyond those that come as default with Java7:

  1. a Unicode editor with input methods for many languages;
  2. use of the input methods in all places where text is edited in the GUI;
  3. a development kit for implementing input methods;
  4. ability to read diverse character encodings.

1 using the editor:
In the GUI, select ‘Unicode editor’ from the ‘Tools’ menu. This will display an editor window, and, when a language with a custom input method is selected for input (see next section), a virtual keyboard window with the characters of the language assigned to the keys on the keyboard. You can enter data either by typing as normal, or with mouse clicks on the virtual keyboard.

2 configuring input methods:
In the editor and in GATE’s main window, the ‘Options’ menu has an ‘Input methods’ choice. All supported input languages (a superset of the JDK languages) are available here. Note that you need to use a font capable of displaying the language you select. By default GATE will choose a Unicode font if it can find one on the platform you’re running on. Otherwise, select a font manually from the ‘Options’ menu ‘Configuration’ choice.

3 using the development kit:
GUK, the GATE Unicode Kit, is documented at http://gate.ac.uk/gate/doc/javadoc/guk/package-summary.html.

4 reading different character encodings:
When you create a document from a URL pointing to textual data in GATE, you have to tell the system what character encoding the text is stored in. By default, GATE will set this parameter to be the empty string. This tells Java to use the default encoding for whatever platform it is running on at the time – e.g. on Western versions of Windoze this will be ISO-8859-1, and Eastern ones ISO-8859-9. A popular way to store Unicode documents is in UTF-8, which is a superset of ASCII (but can still store all Unicode data); if you get an error message about document I/O during reading, try setting the encoding to UTF-8, or some other locally popular encoding. (To see a list of available encodings, try opening a document in GATE’s unicode editor – you will be prompted to select an encoding.)

3.38 Work with Oracle and PostgreSQL [#]

GATE’s Oracle layer is documented separately in http://gate.ac.uk/gate/doc/persistence.pdf. Note that running an Oracle installation is not for the faint-hearted!

GATE version 2.2 has been adapted to work with Postgres 7.3. The compatibility with PostgreSQL 7.2 has been preserved. Since version 7.3 the Postgres server doesn’t downcast from int4 to int2 automatically. However, the JDBC drivers seem to have a bug and send the SMALLINT (aka INT2) parameters as INT (aka INT4). This causes some stored procedures (i.e. all that have input parameters of type INT2) not be recognised. We have fixed this problem by modifying the stored procedures to expose the parameters as INT4 and to manually downcast them inside the stored procedure body.

Please note also the following:

PostgreSQL 7.3 refuses to index values larger than 8Kb/3 (2730 bits). The previous versions probably did the same but without raising an exception.

The only case when such a situation can occur in GATE is when a feature has a TEXTUAL value larger than 2730b. This will be signalled by an exception being raised about the value being too large for the index.

To ”solve” this, one can remove the index on the ft_character_value field of the t_feature table. This will have the usual effects caused by removing an index (incapacity of performing efficient searches).

See the example code at http://gate.ac.uk/gate-examples/doc/.

3.39 Annotate using ontologies [#]

This section deals especially with ontologies: how to create/edit them, make Annotation Schemas out of them, and annotate texts automatically with respect to these schemas.

You can load the ontology tools via the plugin manager from the File menu and then Manage CREOLE plugins. Select the Ontology Tools plugin and these tools will be available to you as Processing Resources in the usual way.

Ontology Editor If you double-click on a loaded ontology in the left resources tree in GATE, it will be loaded in the main window. For more information, see section 10.4.

OntoGazetteer The OntoGazetteer is a Processing Resource in which lists of instances of ontology concepts can be loaded. For the OntoGazetteer to function properly, it needs one or more .lst files, and two .def files, usually called mappings.def, and lists.def.

The .lst file is simply a file with an instance on every line. For instance persons.lst will have, on every line, the name of a person.

The mappings.def file describes the relations between the .lst files and the ontology concepts. The format is .lst file:ontology file:ontology concept.

The lists.def file states the relations between the .lst files and the annotation feature the OntoGazetteer should generate. The format is: .lst file:feature

The OntoGazetteer, when run, generates annotations for every instance mentioned in the .lst files. All these instances will be annotated with the same annotation type, called ’Lookup’ ( in the ’default’ Annotation Set). Every Lookup annotation will have a feature, which is its majorType, which will differ per .lst file.

This is not very useful, because every different concept from the ontology will have the same annotation type (namely ’Lookup’). However, since the ’majorType’ feature differs, we can process them further. And to do that, we need a Jape Transducer.

Jape Transducer A Jape transducer is a Processing Resource for manipulating annotations. For instance, the annotations generated by the OntoGazetteer can be transcribed to distinct annotations. For this, the Jape Transducer needs a Jape grammar, usually stored in a .jape file. A Jape grammar describes which annotations should be changed, and how. See Chapter 7 for further reference on Jape rules.

Creating a pipeline Once a Processing Resource is loaded, it can be included in a pipeline. To do so, right click on ’Applications’ in the left resources tree, and choose one of the appropriate options. A particularly useful one is the ’Corpus Pipeline’ that lets you run an application on an entire corpus. A pipeline is created by dragging the Processing Resources into it. They will be run one after another.

NOTE that in order to run the OntoGazetteer, only the OntoGazetteer needs to be selected, and not the Hash Gazetteer.

How to generate annotations automatically Now that all the building blocks have been discussed, let us describe how to actually construct something useful out of it.

Our aim is to have an Annotation Set that contains the concepts out of the ontology. First, we make an OntoGazetteer. This we build from our ontology, and lists.def and mappings.def file. Suppose we run the OntoGazetteer on our corpus. This would produce, as described above, ’Lookup’ annotations with a feature majorType set to ’Department’.

Now we want to have a Jape rule that turns this annotation into an annotation of type ’Department’. So it should select every annotation with ’Department’ as its majorType, and convert it. This is the rule that does it:

Rule: departmentsRule  
(  
{Lookup.majorType == Department}  
):departmentslabel  
-->  
:departmentslabel.Department = {rule = "departmentsRule"}

This then is all we need. We build a corpus pipeline with first the OntoGazetteer, and then the Jape Transducer, and if it is run, the corpus will get annotated automatically.

Chapter 4
CREOLE: the GATE Component Model [#]

…Noam Chomsky’s answer in Secrets, Lies and Democracy (David Barsamian 1994; Odonian) to “What do you think about the Internet?”

“I think that there are good things about it, but there are also aspects of it that concern and worry me. This is an intuitive response – I can’t prove it – but my feeling is that, since people aren’t Martians or robots, direct face-to-face contact is an extremely important part of human life. It helps develop self-understanding and the growth of a healthy personality.

“You just have a different relationship to somebody when you’re looking at them than you do when you’re punching away at a keyboard and some symbols come back. I suspect that extending that form of abstract and remote relationship, instead of direct, personal contact, is going to have unpleasant effects on what people are like. It will diminish their humanity, I think.”

Chomsky, quoted at http://photo.net/wtr/dead-trees/53015.htm.

The GATE architecture is based on components: reusable chunks of software with well-defined interfaces that may be deployed in a variety of contexts. The design of GATE is based on an analysis of previous work on infrastructure for LE, and of the typical types of software entities found in the fields of NLP and CL (see in particular chapters 4–6 of [Cunningham 00]). Our research suggested that a profitable way to support LE software development was an architecture that breaks down such programs into components of various types. Because LE practice varies very widely (it is, after all, predominantly a research field), the architecture must avoid restricting the sorts of components that developers can plug into the infrastructure. The GATE framework accomplishes this via an adapted version of the Java Beans component framework from Sun. Section 4.2 describes Java’s component model, Java Beans; section 4.3 describes GATE’s extended Beans model.

GATE components may be implemented by a variety of programming languages and databases, but in each case they are represented to the system as a Java class. This class may do nothing other than call the underlying program, or provide an access layer to a database; on the other hand it may implement the whole component.

GATE components are one of three types:

Section 4.4 discusses the disctinction between Language Resources and Processing Resources. Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection of REusable Objects for Language Engineering.

In the rest of this chapter:

4.1 The Web and CREOLE [#]

GATE allows resource implementations and Language Resource persistent data to be distributed over the Web, and uses Java annotations and XML for configuration of resources (and GATE itself).

Resource implementations are grouped together as “plugins”, stored at a URL (when the resources are in the local file system this can be a file:/ URL). When a plugin is loaded GATE it looks for a configuration file called creole.xml relative to the plugin URL and uses the contents of this file to determine what resources this plugin declares and where to find the classes that implement the resource types (typically these classes are stored in a JAR file in the plugin directory). Configuration data for the resources may be stored directly in the creole.xml file, or it may be stored as Java annotations on the resource classes themselves; in either case GATE retrieves this configuration information and adds the resource definitions to the CREOLE register. When a user requests an instantiation of a resource, GATE creates an instance of the resource class in the virtual machine.

Language resource data can be stored in binary serialised form in the local file system, or in an RDBMS like Oracle. In the latter case, communication with the database is over JDBC1, allowing the data to be located anywhere on the network (or anywhere you can get Oracle running, that is!).

4.2 Java Beans: a Simple Component Architecture [#]

All GATE resources are Java Beans, the Java platform’s model of software components. Beans are simply Java classes that obey certain interface conventions. These conventions allow development tools such as GATE, or Borland JBuilder, to manipulate software components without knowing very much about them. The advantage of this is that users of such systems can extend them in diverse ways without having to touch the underlying core of the development tools.

The key parts of the Java Beans specification as used in GATE are:

The rest of this section says a little more about the Beans specification; skip to the next if you’re only interested in how it works in GATE.

Quoting from [Campione et al. 98] at Sun’s Java website:

The JavaBeans API makes it possible to write component software in the Java programming language. Components are self-contained, reusable software units that can be visually composed into composite components, applets, applications, and servlets using visual application builder tools. JavaBean components are known as Beans.

In this context we may think of the GATE development environment as a ‘builder tool’. While the emphasis in the quoted text is on visual representation of components, note that GATE (and other) beans can also be plugged together ‘invisibly’; this is what the framework does and how GATE beans are typically deployed into other applications.

Components expose their features (for example, public methods and events) to builder tools for visual manipulation. A Bean’s features are exposed because feature names adhere to specific design patterns. A JavaBeans-enabled builder tool can then examine the Bean’s patterns, discern its features, and expose those features for visual manipulation. A builder tool maintains Beans in a palette or toolbox. You can select a Bean from the toolbox, drop it into a form, modify it’s appearance and behavior, define its interaction with other Beans, and compose it and other Beans into an applet, application, or new Bean. All this can be done without writing a line of code.

In GATE you develop sets of beans that do language processing tasks and then the framework wires them together without any code from you.

The next section describes GATE’s extended beans model.

4.3 The GATE Framework [#]

We can think of the GATE framework as a backplane into which plug beans-based CREOLE components. The user gives the system a list of URLs to search when it starts up, and components at those locations are loaded by the system.

The backplane performs these functions:

A set of components plus the framework is a deployment unit which can be embedded in another application.

The key task of the development environment is to facilitate constructing components, and viewing and measuring their results.

4.4 Language Resources and Processing Resources [#]

This section describes in more detail the Language Resource and Processing Resource terminology introduced earlier. If you’re happy with these terms you can safely skip this section.

Like other software, LE programs consist of data and algorithms. The current orthodoxy in software development is to model both data and algorithms together, as objects2. Systems that adopt the new approach are referred to as Object-Oriented (OO), and there are good reasons to believe that OO software is easier to build and maintain than other varieties [Booch 94Yourdon 96].

In the domain of human language processing R&D, however, the terminology is a little more complex. Language data, in various forms, is of such significance in the field that it is frequently worked on independently of the algorithms that process it. For example: a treebank3 can be developed independently of the parsers that may later be trained from it; a thesaurus can be developed independently of the query expansion or sense tagging mechanisms that may later come to use it. This type of data has come to have its own term, Language Resources (LRs) [LREC-1 98], covering many data sources, from lexicons to corpora.

In recognition of this distinction, we will adopt the following terminology:

Language Resource (LR):
refers to data-only resources such as lexicons, corpora, thesauri or ontologies. Some LRs come with software (e.g. Wordnet has both a user query interface and C and Prolog APIs), but where this is only a means of accessing the underlying data we will still define such resources as LRs.
Processing Resource (PR):
refers to resources whose character is principally programmatic or algorithmic, such as lemmatisers, generators, translators, parsers or speech recognisers. For example, a part-of-speech tagger is best characterised by reference to the process it performs on text. PRs typically include LRs, e.g. a tagger often has a lexicon; a word sense disambiguator uses a dictionary or thesaurus.

Additional terminology worthy of note in this context: language data refers to LRs which are at their core examples of language in practice, or ‘performance data’, e.g. corpora of texts or speech recordings (possibly including added descriptive information as markup); data about language refers to LRs which are purely descriptive, such as a grammar or lexicon.

PRs can be viewed as algorithms that map between different types of LR, and which typically use LRs in the mapping process. An MT engine, for example, maps a monolingual corpus into a multilingual aligned corpus using lexicons, grammars, etc.4

Further support for the PR/LR terminology may be gleaned from the argument in favour of declarative data structures for grammars, knowledge bases, etc. This argument was current in the late 1980s and early 1990s [Gazdar & Mellish 89], partly as a response to what has been seen as the overly procedural nature of previous techniques such as augmented transition networks. Declarative structures represent a separation between data about language and the algorithms that use the data to perform language processing tasks; a similar separation to that used in GATE.

Adopting the PR/LR distinction is a matter of conforming to established domain practice and terminology. It does not imply that we cannot model the domain (or build software to support it) in an Object-Oriented manner; indeed the models in GATE are themselves Object-Oriented.

4.5 The Lifecycle of a CREOLE Resource [#]

CREOLE resources exhibit a variety of forms depending on the perspective they are viewed from. Their implementation is as a Java class plus an XML metadata file living at the same URL. When using the development environment, resources can be loaded and viewed via the resources tree (left pane) and the ”create resource” mechanism. When programming with the framework, they are Java objects that are obtained by making calls to GATE’s Factory class. These various incarnations are the phases of a CREOLE resource’s ‘lifecycle’. Depending on what sort of task you are using GATE for, you may use resources in any or all of these phases. For example, you may only be interested in getting a graphical view of what GATE’s ANNIE Information Extraction system (see chapter 8) does; in this case you will use the GUI to load the ANNIE resources, and load a document, and create an ANNIE application and run it on the document. If, on the other hand, you want to create your own resources, or modify the Java code of an existing resource (as opposed to just modifying its grammar, for example), you will need to deal with all the lifecylce phases.

The various phases may be summarised as:

Creating a new resource from scratch (bootstrapping).
To create the binary image of a resource (a Java class in a JAR file), and the XML file that describes the resource to GATE, you need to create the appropriate .java file(s), compile them and package them as a .jar. The GATE development environment provides a bootstrap tool to start this process – see section 3.10. Alternatively you can simply copy code from an existing resource.
Instantiating a resource in the framework.
To create a resource in your own Java code, use GATE’s Factory class (this takes care of parameterising the resource, restoring it from a database where appropriate, etc. etc.). Section 3.11 describes how to do this.
Loading a resource in the development environment.
To load a resource in the development environment, use the various “New ... resource” options from the File menu and elsewhere. See section 3.12.
Resource configuration and implementation.
GATE’s bootstrap tool will create an empty resource that does nothing. In order to achieve the behaviour you require, you’ll need to change the configuration of the resource (by editing the creole.xml file) and/or change the Java code that implements the resource. See section 4.9.

More details of the specifics of tasks related to these phases are available in chapter 3.

4.6 Processing Resources and Applications [#]

PRs can be combined into applications. Applications model a control strategy for the execution of PRs. In the framework applications are called ‘controllers’ accordingly.

Currently only sequential, or pipeline, execution is supported. There are two main types of pipeline:

Simple pipelines
simply group a set of PRs together in order and execute them in turn. The implementing class is called SerialController.
Corpus pipelines
are specific for LanguageAnalysers – PRs that are applied to documents and corpora. A corpus pipeline opens each document in the corpus in turn, sets that document as a runtime parameter on each PR, runs all the PRs on the corpus, then closes the document. The implementing class is called SerialAnalyserController.

Conditional versions of these controllers are also available. These allow processing resources to be run conditionally on document features. See Section 3.15 for how to use these.

There is also a real-time version of the corpus pipeline. When creating such a controller, a timeout parameter needs to be set which determines the maximum amount of time (in milliseconds) allowed for the processing of a document. Documents that take longer to process, are simply ignored and the execution moves to the next document after the timeout interval has lapsed.

All controllers have special handling for processing resources that implement the interface gate.creole.ControllerAwarePR. This interface provides methods that are called by the controller at the start and end of the whole application’s execution – for a corpus pipeline, this means before any document has been processed and after all documents in the corpus have been processed, which is useful for PRs that need to share data structures across the whole corpus, build aggregate statistics, etc. For full details, see the JavaDoc documentation for ControllerAwarePR.

4.7 Language Resources and Datastores [#]

Language Resources can be stored in Data Stores. Data Stores are an abstract model of disk-based persistence, which can be implemented by various types of storage mechanism. Here are the types implemented:

Serial Data Stores
are based on Java’s serialisation system, and store data directly into files and directories.
Lucene Data Stores
is a full-featured annotation indexing and retrieval system. It is provided as part of an extension of the Serial Data Stores. See section 9.29 for more details.
Oracle Data Stores
store data into an Oracle RDBMS. For details of how to set up an Oracle DB for GATE, see http://gate.ac.uk/gate/doc/persistence.pdf.
PostgreSQL Data Stores
store data into a PostgreSQL RDBMS. For details of how to set up a PostgreSQL DB for GATE, see http://gate.ac.uk/gate/doc/persistence.pdf.

4.8 Built-in CREOLE Resources [#]

GATE comes with various built-in components:

4.9 CREOLE Resource Configuration [#]

This section describes how to supply GATE with the configuration data it needs about a resource, such as what its parameters are, how to display it if it has a visualisation, etc. Several GATE resources can be grouped into a single plugin, which is a directory containing an XML configuration file called creole.xml. Configuration data for the plugin’s resources can be given in the creole.xml file or directly in the Java source file using Java 5 annotations.

A creole.xml file has a root element <CREOLE-DIRECTORY>, but the further contents of this element depend on the configuration style. The following three sections discuss the different styles – all-XML, all-annotations and a mixture of the two.

4.9.1 Configuration with XML [#]

To configure your resources in the creole.xml file, the <CREOLE-DIRECTORY> element should contain one <RESOURCE> element for each resource type in the plugin. The <RESOURCE> elements may optionally be contained within a <CREOLE> element (to allow a single creole.xml file to be built up by concatenating multiple separate files). For example:

<CREOLE-DIRECTORY>  
 
<CREOLE>  
  <RESOURCE>  
    <NAME>Minipar Wrapper</NAME>  
    <JAR>MiniparWrapper.jar</JAR>  
    <CLASS>minipar.Minipar</CLASS>  
    <COMMENT>MiniPar is a shallow parser. It determines the  
    dependency relationships between the words of a sentence.</COMMENT>  
    <HELPURL>http://gate.ac.uk/cgi-bin/userguide/sec:misc-creole:minipar</HELPURL>  
    <PARAMETER NAME="document"  
  RUNTIME="true"  
  COMMENT="document to process">gate.Document</PARAMETER>  
    <PARAMETER NAME="miniparDataDir"  
        RUNTIME="true"  
        COMMENT="location of the Minipar data directory">  
        java.net.URL  
    </PARAMETER>  
    <PARAMETER NAME="miniparBinary"  
        RUNTIME="true"  
        COMMENT="Name of the Minipar command file">  
        java.net.URL  
    </PARAMETER>  
    <PARAMETER NAME="annotationInputSetName"  
        RUNTIME="true"  
        OPTIONAL="true"  
        COMMENT="Name of the input Source">  
        java.lang.String  
    </PARAMETER>  
    <PARAMETER NAME="annotationOutputSetName"  
        RUNTIME="true"  
        OPTIONAL="true"  
        COMMENT="Name of the output AnnotationSetName">  
        java.lang.String  
    </PARAMETER>  
    <PARAMETER NAME="annotationTypeName"  
        RUNTIME="false"  
        DEFAULT="DepTreeNode"  
        COMMENT="Annotations to store with this type">  
        java.lang.String  
    </PARAMETER>  
  </RESOURCE>  
</CREOLE>  
</CREOLE-DIRECTORY>

Basic resource-level data

Each resource must give a name, a Java class and the JAR file that it can be loaded from. The above example is taken from the Minipar plugin in GATE, and defines a single resource with a number of parameters.

The full list of valid elements under <RESOURCE> is as follows:

NAME
the name of the resource, as it will appear in the “New” menu in the GATE GUI. If omitted, defaults to the bare name of the resource class (without a package name).
CLASS
the fully qualified name of the Java class that implements this resource.
JAR
names JAR files required by this resource (paths are relative to the location of creole.xml). Typically this will be the JAR file containing the class named by the <CLASS> element, but additional <JAR> elements can be used to name third-party JAR files that the resource depends on.
COMMENT
a descriptive comment about the resource, which will appear as the tooltip when hovering over an instance of this resource in the resources tree in the GUI. If omitted, no comment is used.
HELPURL
a URL to a help document on the web for this resource. It is used in the help browser inside GATE.
INTERFACE
the interface type implemented by this resource, for example new types of document would specify <INTERFACE>gate.Document</INTERFACE>.
ICON
the icon used to represent this resource in the GATE GUI. This is a path inside the plugin’s JAR file, for example <ICON>/some/package/icon.png</ICON>. If the path specified does not start with a forward slash, it is assumed to name an icon from the GATE default set, which is located in gate.jar at gate/resources/img. If no icon is specified, a generic language resource or processing resource icon (as appropriate) is used.
PRIVATE
if present, this resource type is hidden in the GUI, i.e. it is not shown in the “New” menus. This is useful for resource types that are intended to be created internally by other resources, or for resources that have parameters of a type that cannot be set in the GUI. <PRIVATE/> resources can still be created in Java code using the Factory.
AUTOINSTANCE (and HIDDEN-AUTOINSTANCE)
tells GATE to automatically create instances of this resource when the plugin is loaded. Any number of auto instances may be defined, GATE will create them all. Each <AUTOINSTANCE> element may optionally contain <PARAM NAME="..." VALUE="..." /> elements giving parameter values to use when creating the instance. Any parameters not specified explicitly will take their default values. Use <HIDDEN-AUTOINSTANCE> if you want the auto instances not to show up in the GATE GUI – this is useful for things like document formats where there should only ever be a single instance in GATE and that instance should not be deleted.

For visual resources, a <GUI> element should also be provided. This takes a TYPE attribute, which can have the value LARGE or SMALL. LARGE means that the visual resource is a large viewer and should appear in the main part of the GATE window on the right hand side, SMALL means the VR is a small viewer which appears in the space below the resources tree in the bottom left. The <GUI> element supports the following sub-elements:

RESOURCE_DISPLAYED
the type of GATE resource this VR can display. Any resource whose type is assignable to this type will be displayed with this viewer, so for example a VR that can display all types of document would specify gate.Document, whereas a VR that can only display the default GATE document implementation would specify gate.corpora.DocumentImpl.
MAIN_VIEWER
if present, GATE will consider this VR to be the “most important” viewer for the given resource type, and will ensure that if several different viewers are all applicable to this resource, this viewer will be the one that is initially visible.

For annotation viewers, you should specify an <ANNOTATION_TYPE_DISPLAYED> element giving the annotation type that the viewer can display (e.g. Sentence).

Resource parameters

Resources may also have parameters of various types. These resources, from the GATE distribution, illustrate the various types of parameters:

<RESOURCE>  
  <NAME>GATE document</NAME>  
  <CLASS>gate.corpora.DocumentImpl</CLASS>  
  <INTERFACE>gate.Document</INTERFACE>  
  <COMMENT>GATE transient document</COMMENT>  
  <OR>  
    <PARAMETER NAME="sourceUrl"  
      SUFFIXES="txt;text;xml;xhtm;xhtml;html;htm;sgml;sgm;mail;email;eml;rtf"  
      COMMENT="Source URL">java.net.URL</PARAMETER>  
    <PARAMETER NAME="stringContent"  
      COMMENT="The content of the document">java.lang.String</PARAMETER>  
  </OR>  
  <PARAMETER  
    COMMENT="Should the document read the original markup"  
    NAME="markupAware" DEFAULT="true">java.lang.Boolean</PARAMETER>  
  <PARAMETER NAME="encoding" OPTIONAL="true"  
    COMMENT="Encoding" DEFAULT="">java.lang.String</PARAMETER>  
  <PARAMETER NAME="sourceUrlStartOffset"  
    COMMENT="Start offset for documents based on ranges"  
    OPTIONAL="true">java.lang.Long</PARAMETER>  
  <PARAMETER NAME="sourceUrlEndOffset"  
    COMMENT="End offset for documents based on ranges"  
    OPTIONAL="true">java.lang.Long</PARAMETER>  
  <PARAMETER NAME="preserveOriginalContent"  
    COMMENT="Should the document preserve the original content"  
    DEFAULT="false">java.lang.Boolean</PARAMETER>  
  <PARAMETER NAME="collectRepositioningInfo"  
    COMMENT="Should the document collect repositioning information"  
    DEFAULT="false">java.lang.Boolean</PARAMETER>  
  <ICON>lr.gif</ICON>  
</RESOURCE>

<RESOURCE>  
  <NAME>Document Reset PR</NAME>  
  <CLASS>gate.creole.annotdelete.AnnotationDeletePR</CLASS>  
  <COMMENT>Document cleaner</COMMENT>  
  <PARAMETER NAME="document" RUNTIME="true">gate.Document</PARAMETER>  
  <PARAMETER NAME="annotationTypes" RUNTIME="true"  
    OPTIONAL="true">java.util.ArrayList</PARAMETER>  
</RESOURCE>

Parameters may be optional, and may have default values (and may have comments to describe their purpose, which is displayed by the GUI during interactive parameter setting).

Some PR parameters are execution time (RUNTIME), some are initialisation time. E.g. at execution time a doc is supplied to a language analyser; at initilisation time a grammar may be supplied to a language analyser.

The <PARAMETER> tag takes the following attributes:

NAME:
name of the JavaBean property that the parameter refers to, i.e. for a parameter named “someParam” the class must have setSomeParam and getSomeParam methods.5
DEFAULT:
default value (see below).
RUNTIME:
doesn’t need setting at initialisation time, but must be set before calling execute(). Only meaningfull for PRs
OPTIONAL:
not required
COMMENT:
for display purposes
ITEM_CLASS_NAME:
(only applies to parameters whose type is java.util.Collection or a type that implements or extends this) this specifies the type of elements the collection contains, so the GUI can use the right type when parameters are set. If omitted, the GUI will pass in the elements as Strings.
SUFFIXES:
(only applies to parameters of type java.net.URL) a semicolon-separated list of file suffixes that this parameter typically accepts, used as a filter in the file chooser provided by the GUI to select a local file as the parameter value.

It is possible for two or more parameters to be mutually exclusive (i.e. a user must specify one or the other but not both). In this case the <PARAMETER> elements should be grouped together under an <OR> element.

The type of the parameter is specified as the text of the <PARAMETER> element, and the type supplied must match the return type of the parameter’s get method. Any reference type (class, interface or enum) may be used as the parameter type, including other resource types – in this case the GUI will offer a list of the loaded instances of that resource as options for the parameter value. Primitive types (char, boolean, …) are not supported, instead you should use the corresponding wrapper type (java.lang.Character, java.lang.Boolean, …). If the getter returns a parameterized type (e.g. List<Integer>) you should just specify the raw type (java.util.List) here6.

The DEFAULT string is converted to the appropriate type for the parameter - java.lang.String parameters use the value directly, primitive wrapper types e.g. java.lang.Integer use their respective valueOf methods, and other built-in Java types can have defaults specified provided they have a constructor taking a String.

The type java.net.URL is treated specially: if the default string is not an absolute URL (e.g. http://gate.ac.uk/) then it is treated as a path relative to the location of the creole.xml file. Thus a DEFAULT of "resources/main.jape" in the file file:/opt/MyPlugin/creole.xml is treated as the absolute URL file:/opt/MyPlugin/resources/main.jape.

For Collection-valued parameters multiple values may be specified, separated by semicolons, e.g. "foo;bar;baz"; if the parameter’s type is an interface – Collection or one of its sub-interfaces (e.g. List) – a suitable concrete class (e.g. ArrayList, HashSet) will be chosen automatically for the default value.

For parameters of type gate.FeatureMap multiple name=value pairs can be specified, e.g. "kind=word;orth=upperInitial". For enum-valued parameters the default string is taken as the name of the enum constant to use. Finally, if no DEFAULT attribute is specified, the default value is null.

4.9.2 Configuring resources using annotations [#]

Annotation-driven configuration is only available in snapshot builds of GATE, build 2988 and later (subversion revision 9845)

As an alternative to the XML configuration style, GATE provides Java 5 annotation types to embed the configuration data directly in the Java source code. @CreoleResource is used to mark a class as a GATE resource, and parameter information is provided through annotations on the JavaBean set methods. At runtime these annotations are read and mapped into the equivalent entries in creole.xml before parsing. The metadata annotation types are all marked @Documented so the CREOLE configuration data will be visible in the generated JavaDoc documentation.

For more detailed information, see the JavaDoc documentation for gate.creole.metadata.

To use annotation-driven configuration a creole.xml file is still required but it need only contain the following:

<CREOLE-DIRECTORY>  
  <JAR SCAN="true">myPlugin.jar</JAR>  
  <JAR>lib/thirdPartyLib.jar</JAR>  
</CREOLE-DIRECTORY>

This tells GATE to load myPlugin.jar and scan its contents looking for resource classes annotated with @CreoleResource. Other JAR files required by the plugin can be specified using other <JAR> elements without SCAN="true".

Basic resource-level data

To mark a class as a CREOLE resource, simply use the @CreoleResource annotation (in the gate.creole.metadata package), for example:

import gate.creole.AbstractLanguageAnalyser;  
import gate.creole.metadata.*;  
 
@CreoleResource(name = "GATE Tokeniser",  
                comment = "Splits text into tokens and spaces")  
public class Tokeniser extends AbstractLanguageAnalyser {  
  ...

The @CreoleResource annotation provides slots for all the values that can be specified under <RESOURCE> in creole.xml, except <CLASS> (inferred from the name of the annotated class) and <JAR> (taken to be the JAR containing the class):

name
(String) the name of the resource, as it will appear in the “New” menu in the GATE GUI. If omitted, defaults to the bare name of the resource class (without a package name). (XML equivalent <NAME>)
comment
(String) a descriptive comment about the resource, which will appear as the tooltip when hovering over an instance of this resource in the resources tree in the GUI. If omitted, no comment is used. (XML equivalent <COMMENT>)
helpURL
(String) a URL to a help document on the web for this resource. It is used in the help browser inside GATE. (XML equivalent <HELPURL>)
isPrivate
(boolean) should this resource type be hidden from the GUI, so it does not appear in the “New” menus? If omitted, defaults to false (i.e. not hidden). (XML equivalent <PRIVATE/>)
icon
(String) the icon to use to represent the resource in the GUI. If omitted, a generic language resource or processing resource icon is used. (XML equivalent <ICON>, see the description above for details)
interfaceName
(String) the interface type implemented by this resource, for example a new type of document would specify "gate.Document" here. (XML equivalent <INTERFACE>)
autoInstances
(array of @AutoInstance annotations) definitions for any instances of this resource that should be created automatically when the plugin is loaded. If omitted, no auto-instances are created by default. (XML equivalent, one or more <AUTOINSTANCE> and/or <HIDDEN-AUTOINSTANCE> elements, see the description above for details)

For visual resources only, the following elements are also available:

guiType
(GuiType enum) the type of GUI this resource defines. (XML equivalent <GUI TYPE="LARGE|SMALL">)
resourceDisplayed
(String) the class name of the resource type that this VR displays, e.g. "gate.Corpus". (XML equivalent <RESOURCE_DISPLAYED>)
mainViewer
(boolean) is this VR the “most important” viewer for its displayed resource type? (XML equivalent <MAIN_VIEWER/>, see above for details)

For annotation viewers, you should specify an annotationTypeDisplayed element giving the annotation type that the viewer can display (e.g. Sentence).

Resource parameters

Parameters are declared by placing annotations on their JavaBean set methods. To mark a setter method as a parameter, use the @CreoleParameter annotation, for example:

  @CreoleParameter(comment = "The location of the list of abbreviations")  
  public void setAbbrListUrl(URL listUrl) {  
    ...

GATE will infer the parameter’s name from the name of the JavaBean property in the usual way (i.e. strip off the leading set and convert the following character to lower case, so in this example the name is abbrListUrl). The parameter name is not taken from the name of the method parameter. The parameter’s type is inferred from the type of the method parameter (java.net.URL in this case).

The annotation elements of @CreoleParameter correspond to the attributes of the <PARAMETER> tag in the XML configuration style:

comment
(String) an optional descriptive comment about the parameter. (XML equivalent COMMENT)
defaultValue
(String) the optional default value for this parameter. The value is specified as a string but is converted to the relevant type by GATE according to the conversions described in the previous section. Note that relative path default values for URL-valued parameters are still relative to the location of the creole.xml file, not the annotated class. (XML equivalent DEFAULT)
suffixes
(String) for URL-valued parameters, a semicolon-separated list of default file suffixes that this parameter accepts. (XML equivalent SUFFIXES)
collectionElementType
(Class) for Collection-valued parameters, the type of the elements in the collection. This can usually be inferred from the generic type information, for example public void setIndices(List<Integer> indices), but must be specified if the set method’s parameter has a raw (non-parameterized) type. (XML equivalent ITEM_CLASS_NAME)

Mutually-exclusive parameters (such as would be grouped in an <OR> in creole.xml) are handled by adding a disjunction="label" to the @CreoleParameter annotation – all parameters that share the same label are grouped in the same disjunction.

Optional and runtime parameters are marked using extra annotations, for example:

  @Optional  
  @RunTime  
  @CreoleParameter  
  public void setAnnotationSetName(String asName) {  
    ...

Inheritance

Unlike with pure XML configuration, when using annotations a resource will inherit any configuration data that was not explicitly specified from annotations on its parent class and on any interfaces it implements. Specifically, if you do not specify a comment, interfaceName, icon, annotationTypeDisplayed or the GUI-related elements (guiType and resourceDisplayed) on your @CreoleResource annotation then GATE will look up the class tree for other @CreoleResource annotations, first on the superclass, its superclass, etc., then at any implemented interfaces, and use the first value it finds. This is useful if you are defining a family of related resources that inherit from a common base class.

The resource name and the isPrivate and mainViewer flags are not inherited.

Parameter definitions are inherited in a similar way. This is one of the big advantages of annotation configuration over pure XML – if one resource class extends another then with pure XML configuration all the parent class’s parameter definitions must be duplicated in the subclass’s creole.xml definition. With annotations, parameters are inherited from the parent class (and its parent, etc.) as well as from any interfaces implemented. For example, the gate.LanguageAnalyser interface provides two parameter definitions via annotated set methods, for the corpus and document parameters. Any @CreoleResource annotated class that implements LanguageAnalyser, directly or indirectly, will get these parameters automatically.

Of course, there are some cases where this behaviour is not desirable, for example if a subclass calculates a value for a superclass parameter rather than having the user set it directly. In this case you can hide the parameter by overriding the set method in the subclass and using a marker annotation:

  @HiddenCreoleParameter  
  public void setSomeParam(String someParam) {  
    super.setSomeParam(someParam);  
  }

The overriding method will typically just call the superclass one, as its only purpose is to provide a place to put the @HiddenCreoleParameter annotation.

Alternatively, you may want to override some of the configuration for a parameter but inherit the rest from the superclass. Again, this is handled by trivially overriding the set method and re-annotating it:

  // superclass  
  @CreoleParameter(comment = "Location of the grammar file",  
                   suffixes = "jape")  
  public void setGrammarUrl(URL grammarLocation) {  
    ...  
  }  
 
  @Optional  
  @RunTime  
  @CreoleParameter(comment = "Feature to set on success")  
  public void setSuccessFeature(String name) {  
    ...  
  }

  //-----------------------------------  
  // subclass  
 
  // override the default value, inherit everything else  
  @CreoleParameter(defaultValue = "resources/defaultGrammar.jape")  
  public void setGrammarUrl(URL url) {  
    super.setGrammarUrl(url);  
  }  
 
  // we want the parameter to be required in the subclass  
  @Optional(false)  
  @CreoleParameter  
  public void setSuccessFeature(String name) {  
    super.setSuccessFeature(name);  
  }

Note that for backwards compatibility, data is only inherited from superclass annotations if the subclass is itself annotated with @CreoleResource. If the subclass is not annotated then GATE assumes that all its configuration is contained in creole.xml in the usual way.

4.9.3 Mixing the configuration styles [#]

It is possible and often useful to mix and match the XML and annotation-driven configuration styles. The rule is always that anything specified in the XML takes priority over the annotations. The following examples show what this allows.

Overriding configuration for a third-party resource

Suppose you have a plugin from some third party that uses annotation-driven configuration. You don’t have the source code but you would like to override the default value for one of the parameters of one of the plugin’s resources. You can do this in the creole.xml:

<CREOLE-DIRECTORY>  
  <JAR SCAN="true">acmePlugin-1.0.jar</JAR>  
 
  <!-- Add the following to override the annotations -->  
  <RESOURCE>  
    <CLASS>com.acme.plugin.UsefulPR</CLASS>  
    <PARAMETER NAME="listUrl"  
      DEFAULT="resources/myList.txt">java.net.URL</PARAMETER>  
  </RESOURCE>  
</CREOLE-DIRECTORY>

The default value for the listUrl parameter in the annotated class will be replaced by your value.

External AUTOINSTANCEs

For resources like document formats, where there should always and only be one instance in GATE at any time, it makes sense to put the auto-instance definitions in the @CreoleResource annotation. But if the automatically created instances are a convenience rather than a neccessity it may be better to define them in XML so other users can disable them without re-compiling the class:

<CREOLE-DIRECTORY>  
  <JAR SCAN="true">myPlugin.jar</JAR>  
 
  <RESOURCE>  
    <CLASS>com.acme.AutoPR</CLASS>  
    <AUTOINSTANCE>  
      <PARAM NAME="type" VALUE="Sentence" />  
    </AUTOINSTANCE>  
    <AUTOINSTANCE>  
      <PARAM NAME="type" VALUE="Paragraph" />  
    </AUTOINSTANCE>  
  </RESOURCE>  
</CREOLE-DIRECTORY>

Inheriting parameters

If you would prefer to use XML configuration for your own resources, but would like to benefit from the parameter inheritance features of the annotation-driven approach, you can write a normal creole.xml file with all your configuration and just add a blank @CreoleResource annotation to your class. For example:

package com.acme;  
import gate.*;  
import gate.creole.metadata.CreoleResource;  
 
@CreoleResource  
public class MyPR implements LanguageAnalyser {  
  ...  
}

<!-- creole.xml -->  
<CREOLE-DIRECTORY>  
  <CREOLE>  
    <RESOURCE>  
      <NAME>My Processing Resource</NAME>  
      <CLASS>com.acme.MyPR</CLASS>  
      <COMMENT>...</COMMENT>  
      <PARAMETER NAME="annotationSetName"  
        RUNTIME="true" OPTIONAL="true">java.lang.String</PARAMETER>  
      <!--  
      don’t need to declare document and corpus parameters, they  
      are inherited from LanguageAnalyser  
      -->  
    </RESOURCE>  
  </CREOLE>  
</CREOLE-DIRECTORY>

N.B. Without the @CreoleResource the parameters would not be inherited.

Chapter 5
Visual CREOLE [#]

...neurobiologists still go on openly studying reflexes and looking under the hood, not huddling passively in the trenches. Many of them still keep wondering: how does the inner life arise? Ever puzzled, they oscillate between two major fictions: (1) The brain can be understood; (2) We will never come close. Meanwhile they keep pursuing brain mechanisms, partly from habit, partly out of faith. Their premise: The brain is the organ of the mind. Clearly, this three-pound lump of tissue is the source of our ”insight information” about our very being. Somewhere in it there might be a few hidden guidelines for better ways to lead our lives.

Zen and the Brain, James H. Austin, 1998 (p. 6).

This chapter details the other visual resources that can be used in GATE. While these tools were not included as part of earlier releases of GATE, as of GATE version 3.0, they are included as part of the standard release, and are now open source. GAZE, Ontogazetteer and the Protégé VR for GATE were all developed by Ontotext, who should be contacted for further information about these components.

5.1 Gazetteer Visual Resource - GAZE [#]

Gaze is a tool for editing the gazetteer lists, definitions and mapping to ontology. It is suitable for use both for Plain/Linear Gazetteers (Default and Hash Gazetteers) and Ontology-enabled Gazetteers (OntoGazetteer). The Gazetteer PR associated with the viewer is reinitialised every time a save operation is performed. Note that GAZE does not scale up to very large lists (we suggest not using it to view over 40,000 entries and not to copy inside more than 10, 000 entries).

5.1.1 Running Modes

The running mode depends on the type of gazetteer loaded in the VR. The mode in which Linear/Plain Gazetteers are loaded is called Linear/Plain Mode. In this mode, the Linear Definition is displayed in the left pane, and the Gazetteer List is displayed in the right pane. The Extended/Ontology/Mapping mode is on when the displayed gazetteer is ontology-aware, which means that there exists a mapping between classes in the ontology and lists of phrases. Two more panes are displayed when in this mode. On the top in the left-most pane there is a tree view of the ontology hierarchy, and at the bottom the mapping definition is displayed.

5.1.2 Loading a Gazetteer

To load a gazetteer into the viewer it is necessary to associate the Gaze VR with the gazetteers. Afterwards whenever a gazetteer PR is loaded, Gaze will appear on double-click over the gazetteer in the Processing Resources branch of the Resources Tree.

5.1.3 Linear Definition Pane

This pane displays the nodes of the linear definition, and allows manipulation of the whole definition as a file, as well as the single nodes. Whenever a gazetteer list is modified, its node in the linear definition is coloured in red.

5.1.4 Linear Definition Toolbar

All the functionality explained in this section (New, Load, Save, Save As) is accessible also via File — Linear Definition in the menu bar of Gaze.

New – Pressing New invokes a file dialog where the location of the new definition is specified.

Load – Pressing Load invokes a file dialog, and after locating the new definition it is loaded by pressing Open.

Save – Pressing Save saves the definition to the location from which it has been read.

Save As – Pressing Save As allows another location to be chosen, and the definition saved there.

5.1.5 Operations on Linear Definition Nodes

Double-click node – Double-clicking on a definition node forces the displaying of the gazetteer list of the node in the right-most pane of the viewer.

Insert – On right-click over a node and choosing Insert, a dialog is displayed, requesting List, Major Type, Minor Type and Languages. The mandatory fields are List and Major Type. After pressing OK, a new linear node is added to the definition.

Remove – On right-click over a node and choosing Remove, the selected linear node is removed from the definition.

Edit – On right-click over a node and choosing Edit a dialog is displayed allowing changes of the fields List, Major Type, Minor Type and Languages.

5.1.6 Gazetteer List Pane

The gazetteer list pane has a toolbar with similar to the linear definition’s buttons (New, Load, Save, Save As). They work as predicted by their names and as explained in the Linear Definition Pane section, and are also accessible from File / Gazetteer List in the menu bar of Gaze. The only addition is Save All which saves all modified gazetteer lists. The editing of the gazetteer list is as simple as editing a text file. One could use Ctrl+A to select the whole list, Ctrl+C to copy the selected, Ctrl+V to paste it, Del to delete the selected text or a single character, etc.

5.1.7 Mapping Definition Pane

The mapping definition is displayed one mapping node per row. It consists of a gazetteer list, ontology URL, and class id. The content of the gazetteer list in the node is accessible through double-clicking. It is displayed in the Gazetteer List Pane. The toolbar allows the creation of a new definition (New), the loading of an existing one (Load), saving to the same or new location (Save/Save As). The functionality of the toolbar buttons is also available via File.

5.2 Ontogazetteer [#]

The Ontogazetteer, or Hierarchical Gazetteer, is an interface which makes ontologies “visible” in GATE, supporting basic methods for hierarchy management and traversal. In GATE, an ontology is represented at the same level as a document, and has nodes called classes (for consistency with RDFs ad DAML+OIL, though they are really just types). The OntoGazetteer assigns classes rather than major or minor types, and is aware of mappings between lists and class IDs. There are two Visual Resources, one for editing the standard gazetteer lists (including the definition files and the mappings to the ontology), and one for editing the ontology itself.

5.2.1 Gazetteer Lists Editor and Mapper [#]

This is a VR for editing the gazetteer lists, and mapping them to classes in an ontology. It provides load/store/edit for the lists, load/store/edit for the mapping information, loading of ontologies, load/store/edit for the linear definition file, and mapping of the lists file to the major type, minor type and language.

Left pane: A single ontology is visualized in the left pane of the VR. The mapping between a list and a class is displayed by showing the list as a subclass with a different icon. The mapping is specified by drag and drop from the linear definition pane (in the middle) and/or by right click menu.

Middle pane: The middle pane displays the nodes/lines in the linear definition file. By double clicking on a node the corresponding list is opened. Editing of the line/node is done by right clicking and choosing edit: a dialogue appears (lower part of the scheme) allowing the modification of the members of the node.

Right pane: In the right pane a single gazetteer list is displayed. It can be edited and parts of it can be cut/copied/pasted.

5.2.2 Ontogazetteer Editor [#]

This is a VR for editing the class hierarchy of an ontology. it provides storing to and loading from RDF/RDFS, and provides load/edit/store of the class hierarchy of an ontology.

Left pane: The various ontologies loaded are listed here. On double click or right click and edit from the menu the ontology is visualized in the Right pane.

Right pane: Besides the visualization of the class hierarchy of the ontology the following operations are allowed:

As a result of this VR, the ontology definition file is affected/altered.

5.3 The Document Editor [#]


PIC


Figure 5.1: Main window with a document editor showing the location http://gate.ac.uk. You can see a popup window under the word ’EPSRC’ for creating/editing an annotation, the table of annotations highlighted at the bottom and the list of existing annotation types on the left.


The document editor is contained in the central tabbed pane as seen on figure 5.1. It consist of a top panel with buttons and icons that control the display of different views and the search box.

The central part is the text view, then at the bottom there is the annotations list view and at the left the annotation sets view which can be replaced with the co-reference editor.

These views are describe in the next subsections so we will now focus only on the annotation editor popup window that you can see in the middle of the document editor.

The annotation editor consist of different action icons at the top then a drop down box for the annotation type, a table of features names and values and finally a disclosure panel for the search and annotate function.

To grow/shrink the span of the annotation at its start use the two arrow icons on the left or Right and Left keys. Use the two arrow icons next on the right to change the annotation end or Alt+Right and Alt+Left keys. Add Shift and Control+Shift keys to make the span increment bigger. The red X icon is for removing the annotation and the pin icon is to pin the window so it doesn’t move when you select another annotation.

All the views are updated each time an annotation change. There is more than one way to create or edit annotations so try to find the best for your task. For example, if you want to delete all the annotations of one type that are at the beginning of a document you can use the annotations list view, then sort it by start offset, select the rows to delete and right-click for the context menu to delete the selection. It will be much faster than selecting each annotation in the document editor and delete it.

For more information on how to create and edit annotations or search and annotate the document see section 3.19.

See also section 12.2.2 for the compound document editor.

5.3.1 The Annotation Sets View [#]

The annotation sets view is displayed on the left part of the document editor. It’s a tree-like view with a root for each annotation set. The first annotation set being the default one without name.

To display all the annotation of one type tick its checkbox or use Space key. To delete an annotation type use Delete key. To change the color use Enter key. There is a context menu for all these actions that you can display by right-clicking on one annotation type, a selection or an annotation set.

To create a new annotation set use the text field at the bottom and the ’New’ button.

5.3.2 The Annotations List View [#]

The annotations list view is displayed at the bottom of the document editor. It’s a table of all the highlighted annotations in the document. You can sort the table by clicking on the headers and hide some column by right-clicking on the headers.

A context menu is available on the selected rows and allow to delete annotations or display one or more annotation editor.

5.3.3 The Co-reference Editor [#]


PIC


Figure 5.2: Co-reference editor inside a document editor. The popup window in the document under the word ’EPSRC’ is used to add highlighted annotations to a co-reference chain. Here the annotation type ’Organization’ of the annotation set ’Default’ is highlighted and also the co-references ’EC’ and ’GATE’.


The co-reference editor allows co-reference chains (see section 8.8) to be displayed and edited in the GATE GUI. To display the co-reference editor, first open a document in GATE, and then click on the Co-reference Editor button in the document viewer.

The combo box at the top of the co-reference editor allows you to choose which annotation set to display co-references for. If an annotation set contains no co-reference data, then the tree below the combo box will just show ’Coreference Data’ and the name of the annotation set. However, when co-reference data does exist, a list of all the co-reference chains that are based on annotations in the currently selected set is displayed. The name of each co-reference chain in this list is the same as the text of whichever element in the chain is the longest. It is possible to highlight all the member annotations of any chain by selecting it in the list.

When a co-reference chain is selected, if the mouse is placed over one of its member annotations, then a pop-up box appears, giving the user the option of deleting the item from the chain. If the only item in a chain is deleted, then the chain itself will cease to exist, and it will be removed from the list of chains. If the name of the chain was derived from the item that was deleted, then the chain will be given a new name based on the next longest item in the chain.

A combo box near the top of the co-reference editor allows the user to select an annotation type from the current set. When the Show button is selected all the annotations of the selected type will be highlighted. Now when the mouse pointer is placed over one of those annotations, a pop-up box will appear giving the user the option of adding the annotation to a co-reference chain. The annotation can be added to an existing chain by typing the name of the chain (as shown in the list on the right) in the pop-up box. Alternatively, if the user presses the down cursor key, a list of all the existing annotations appears, together with the option [New Chain]. Selecting the [New Chain] option will cause a new chain to be created containing the selected annotation as its only element.

Each annotation can only be added to a single chain, but annotations of different types can be added to the same chain, and the same text can appear in more than one chain if it is referenced by two or more annotations.

Chapter 6
Language Resources: Corpora, Documents and Annotations [#]

Sometimes in life you’ve got to dance like nobody’s watching.

I think they should introduce ’sleeping’ to the Olympics. It would be an excellent field event, in which the ’athletes’ (for want of a better word) all lay down in beds, just beyond where the javelins land, and the first one to fall asleep and not wake up for three hours would win gold. I, for one, would be interested in seeing what kind of personality would be suited to sleeping in a competitive environment.

Life is a mystery to be lived, not a problem to be solved.

Round Ireland with a Fridge, Tony Hawks, 1998 (pp. 119, 147, 179).

This chapter documents GATE’s model of corpora, documents and annotations on documents. Section 6.1 describes the simple attribute/value data model that corpora, documents and annotations all share. Section 6.2, section 6.3 and section 6.4 describe corpora, documents and annotations on documents respectively. Section 6.5 describes GATE’s support for diverse document formats, and section 6.6 describes facilities for XML input/output.

6.1 Features: Simple Attribute/Value Data [#]

GATE has a single model for information that describes documents, collections of documents (corpora), and annotations on documents, based on attribute/value pairs. Attribute names are strings; values can be any Java object. The API for accessing this feature data is Java’s Map interface (part of the Collections API).

6.2 Corpora: Sets of Documents plus Features [#]

A Corpus in GATE is a Java Set whose members are Documents. Both Corpora and Documents are types of LanguageResource (LR); all LRs have a FeatureMap (a Java Map) associated with them that stored attribute/value information about the resource. FeatureMaps are also used to associate arbitrary information with ranges of documents (e.g. pieces of text) via the annotation model (see below).

Documents have a DocumentContent which is a text at present (future versions may add support for audiovisual content) and one or more AnnotationSets which are Java Sets.

6.3 Documents: Content plus Annotations plus Features [#]

Documents are modelled as content plus annotations (see section 6.4) plus features (see section 6.1). The content of a document can be any subclass of DocumentContent.

6.4 Annotations: Directed Acyclic Graphs [#]

Annotations are organised in graphs, which are modelled as Java sets of Annotation. Annotations may be considered as the arcs in the graph; they have a start Node and an end Node, an ID, a type and a FeatureMap. Nodes have pointers into the sources document, e.g. character offsets.

6.4.1 Annotation Schemas [#]

Annotation schemas provide a means to define types of annotations in GATE. GATE uses the XML Schema language supported by W3C for these definitions. When using the development environment to create/edit annotations, a component is available (gate.gui.SchemaAnnotationEditor) which is driven by an annotation schema file. This component will constrain the data entry process to ensure that only annotations that correspond to a particular schema are created. (Another component allows unrestricted annotations to be created.)

Schemas are resources just like other GATE components. Below we give some examples of such schemas. Section 3.21 describes how to create new schemas.

Date schema
<?xml version="1.0"?>  
<schema  
xmlns="http://www.w3.org/2000/10/XMLSchema">  
 <!-- XSchema deffinition for Date-->  
  <element name="Date">  
    <complexType>  
      <attribute name="kind"  use="optional">  
        <simpleType>  
          <restriction base="string">  
            <enumeration value="date"/>  
            <enumeration value="time"/>  
            <enumeration value="dateTime"/>  
          </restriction>  
        </simpleType>  
    </attribute>  
  </complexType>  
 </element>  
</schema>

Person schema
<?xml version="1.0"?>  
<schema  
xmlns="http://www.w3.org/2000/10/XMLSchema">  
    <!-- XSchema definition for Person-->  
    <element name="Person" />  
</schema>

Address schema
<?xml version="1.0"?> <schema  
xmlns="http://www.w3.org/2000/10/XMLSchema">  
    <!-- XSchema deffinition for Address-->  
    <element name="Address">  
      <complexType>  
        <attribute name="kind"  use="optional">  
          <simpleType>  
            <restriction base="string">  
              <enumeration value="email"/>  
              <enumeration value="url"/>  
              <enumeration value="phone"/>  
              <enumeration value="ip"/>  
              <enumeration value="street"/>  
              <enumeration value="postcode"/>  
              <enumeration value="country"/>  
              <enumeration value="complete"/>  
            </restriction>  
        </simpleType>  
    </attribute>  
  </complexType>  
</element>  
</schema>

6.4.2 Examples of Annotated Documents [#]

This section shows some simple examples of annotated documents.

This material is adapted from [Grishman 97], the TIPSTER Architecture Design document upon which GATE version 1 was based. Version 2 has a similar model, although annotations are now graphs, and instead of multiple spans per annotation each annotation now has a single start/end node pair. The current model is largely compatible with [Bird & Liberman 99], and roughly isomorphic with "stand-off markup" as latterly adopted by the SGML/XML community.

Each example is shown in the form of a table. At the top of the table is the document being annotated; immediately below the line with the document is a ruler showing the position (byte offset) of each character (see TIPSTER Architecture Design Document).

Underneath this appear the annotations, one annotation per line. For each annotation is shown its Id, Type, Span (start/end offsets derived from the start/end nodes), and Features. Integers are used as the annotation Ids. The features are shown in the form name = value.

The first example shows a single sentence and the result of three annotation procedures: tokenization with part-of-speech assignment, name recognition, and sentence boundary recognition. Each token has a single feature, its part of speech (pos), using the tag set from the University of Pennsylvania Tree Bank; each name also has a single feature, indicating the type of name: person, company, etc.







Text





Cyndi savored the soup.





^0...^5...^10..^15..^20





Annotations





IdType SpanStartSpan EndFeatures





1 token 0 5 pos=NP





2 token 6 13 pos=VBD





3 token 14 17 pos=DT





4 token 18 22 pos=NN





5 token 22 23





6 name 0 5 name_type=person





7 sentence0 23






Table 6.1: Result of annotation on a single sentence

Annotations will typically be organized to describe a hierarchical decomposition of a text. A simple illustration would be the decomposition of a sentence into tokens. A more complex case would be a full syntactic analysis, in which a sentence is decomposed into a noun phrase and a verb phrase, a verb phrase into a verb and its complement, etc. down to the level of individual tokens. Such decompositions can be represented by annotations on nested sets of spans. Both of these are illustrated in the second example, which is an elaboration of our first example to include parse information. Each non-terminal node in the parse tree is represented by an annotation of type parse.







Text





Cyndi savored the soup.





^0...^5...^10..^15..^20





Annotations





IdType SpanStartSpan EndFeatures





1 token 0 5 pos=NP





2 token 6 13 pos=VBD





3 token 14 17 pos=DT





4 token 18 22 pos=NN





5 token 22 23





6 name 0 5 name_type=person





7 sentence0 23 constituents=[1],[2],[3].[4],[5]






Table 6.2: Result of annotations including parse information

In most cases, the hierarchical structure could be recovered from the spans. However, it may be desirable to record this structure directly through a constituents feature whose value is a sequence of annotations representing the immediate constituents of the initial annotation. For the annotations of type parse, the constituents are either non-terminals (other annotations in the parse group) or tokens. For the sentence annotation, the constituents feature points to the constituent tokens. A reference to another annotation is represented in the table as "[ Annotation Id]"; for example, "[3]" represents a reference to annotation 3. Where the value of an feature is a sequence ofitems, these items are separated by commas. No special operations are provided in the current architecture for manipulating constituents. At a less esoteric level, annotations can be used to record the overall structure of documents, including in particular documents which have structured headers, as is shown in the third example (Table 6.3).







Text





To: All Barnyard Animals





^0...^5...^10..^15..^20.





From: Chicken Little





^25..^30..^35..^40..





Date: November 10,1194





...^50..^55..^60..^65.





Subject: Descending Firmament





.^70..^75..^80..^85..^90..^95





Priority: Urgent





.^100.^105.^110.





The sky is falling. The sky is falling.





....^120.^125.^130.^135.^140.^145.^150.





Annotations





IdType SpanStartSpan EndFeatures





1 Addressee4 24





2 Source 31 45





3 Date 53 69 ddmmyy=101194





4 Subject 78 98





5 Priority 109 115





6 Body 116 155





7 Sentence 116 135





8 Sentence 136 155






Table 6.3: Annotation showing overall document structure

If the Addressee, Source, ... annotations are recorded when the document is indexed for retrieval, it will be possible to perform retrieval selectively on information in particular fields. Our final example (Table 6.4) involves an annotation which effectively modifies the document. The current architecture does not make any specific provision for the modification of the original text. However, some allowance must be made for processes such as spelling correction. This information will be recorded as a correction feature on token annotations and possibly on name annotations:







Text





Topster tackles 2 terrorbytes.





^0...^5...^10..^15..^20..^25..





Annotations





IdTypeSpanStartSpan EndFeatures





1 token0 7 pos=NP correction=TIPSTER





2 token8 15 pos=VBZ





3 token16 17 pos=CD





4 token18 29 pos=NNS correction=terabytes





5 token29 30






Table 6.4: Annotation modifying the document

6.4.3 Creating, Viewing and Editing Diverse Annotation Types [#]

Note that annotation types should consist of a single word with no spaces. Otherwise they may not be recognised by other components such as JAPE transducers, and may create problems when annotations are saved as inline (save preserving format).

To view and edit annotation types, see Section 3.16. To add annotations of a new type, see Section 3.19. To add a new annotation schema, see Section 3.21.

6.5 Document Formats [#]

The following document formats are supported by GATE:

By default GATE will try and identify the type of the document, then strip and convert any markup into GATE’s annotation format. To disable this process, set the markupAware parameter on the document to false.

When reading a document of one of these types, GATE extracts the text between tags (where such exist) and create a GATE annotation filled as follows:

The name of the tag will constitute the annotation’s type, all the tags attributes will materialize in the annotation’s features and the annotation will span over the text covered by the tag. A few exceptions of this rule apply for the RTF, Email and Plain Text formats, which will be described later in the input section of these formats.

The text between tags is extracted and appended to the GATE document’s content and all annotations created from tags will be placed into a GATE annotation set named “Original markups”.

Example:

If the markup is like this:

<aTagName attrib1="value1" attrib2="value2" attrib3="value3"> A  
piece of text</aTagName>

then the annotation created by GATE will look like:

annotation.type = "aTagName";  
annotation.fm={attrib1=value1;atrtrib2=value2;attrib3=value3};  
annotation.start=startNode;  
annotation.end = endNode;

The startNode and endNode are created from offsets refereing the beginning and the end of “A piece of text” in the document’s content.

The documents supported by GATE have to be in one of the encodings accepted by Java. The most popular is the “UTF-8” encoding which is also the most storage efficient one for UNICODE. If, when loading a document in GATE the encoding parameter is set to “”(the empty string), then the default encoding of the platform will be used.

6.5.1 Detecting the right reader [#]

In order to successfully apply the document creation algorithm described above, GATE needs to detect the proper reader to use for each document format. If the user knows in advance what kind of document they are loading then they can specify the MIME type (e.g. text/html) using the init parameter mimeType, and GATE will respect this. If an explicit type is not given, GATE attempts to determine the type by other means, taking into consideration (where possible) the information provided by three sources:

The first represents the extension of a file like (xml,htm,html,txt,sgm,rtf, etc), the second represents the HTTP information sent by a web server regarding the content type of the document being send by it (text/html; text/xml, etc), and the third one represents certain sequences of chars which are ultimately number sequences. GATE is capable to support multimedia documents, if the right reader is added to the framework. Sometimes, multimedia documents are identified by a signature consisting in a sequence of numbers. Inside GATE they are called magic numbers. For textual documents, certain char sequences form such magic numbers. Examples of magic numbers sequences will be provided in the Input section of each format supported by GATE.

All those tests are applied to each document read, and after that, a voting mechanism decides what is the best reader to associate with the document. There is a degree of priority for all those tests. The document’s extension test has the highest priority. If the system is in doubt which reader to choose, then the one associated with document’s extension will be selected. The next higher priority is given to the web server’s content type and the third one is given to the magic numbers detection. However, any two tests that identify the same mime type, will have the highest priority in deciding the reader that will be used. The web server test is not always successful as there might be documents that are loaded from a local file system, and the magic number detection test is not always applicable. In the next paragraphs we will se how those tests are performed and what is the general mechanism behind reader detection.

The method that detects the proper reader is a static one, and it belongs to the gate.DocumentFormat class. It uses the information stored in the maps filled by the init() method of each reader. This method comes with three signatures:

static public DocumentFormat getDocumentFormat( gate.Document  
aGateDocument, URL url)  
 
static public DocumentFormat getDocumentFormat(gate.Document  
aGateDocument, String fileSuffix)  
 
static public DocumentFormat getDocumentFormat(gate.Document  
aGateDocument, MimeType mimeType)  

The first two methods try to detect the right MimeType for the GATE document, and after that, they call the third one to return the reader associate with a MimeType. Of course, if an explicit mimeType parameter was specified, GATE calls the third form of the method directly, passing the specified type. GATE uses the implementation from “http://jigsaw.w3.org” for mime types.

The magic numbers test is performed using the information form
magic2mimeTypeMap map. Each key from this map, is searched in the first bufferSize (the default value is 2048) chars of text. The method that does this is called
runMagicNumbers(InputStreamReader aReader) and it belongs to DocumentFormat class. More details about it can be found in the GATE API documentation.

In order to activate a reader to perform the unpacking, the creole definition of a GATE document defines a parameter called “markupAware” initialized with a default value of true. This parameter, forces GATE to detect a proper reader for the document being read. If no reader is found, the document’s content is load and presented to the user, just like any other text editor (this for textual documents).

The next subsections investigates particularities for each format and will describe the file extensions registered with each document format.

6.5.2 XML [#]

Input [#]

GATE permits the processing of any XML document and offers support for XML namespaces. It benefits the power of Apache’s Xerces parser and also makes use of Sun’s JAXP layer. Changing the XML parser in GATE can be achieved by simply replacing the value of a Java system property (”javax.xml.parsers.SAXParserFactory”).

GATE will accept any well formed XML document as input. Although it has the possibility to validate XML documents against DTDs it does not do so because the validating procedure is time consuming and in many cases it issues messages that are annoying for the user.

There is an open problem with the general approach of reading XML, HTML and SGML documents in GATE. As we previously said, the text covered by tags/elements is appended to the GATE document content and a GATE annotation refers to this particular span of text. When appending, in cases such as “end.</P><P>Start” it might happen to concatenate the ending word of the previous annotation with the beginning phrase of the annotation currently being created, resulting in a garbage input for GATE processing resources that operate at the text surface.

Let’s take another example in order to better understand the problem :

<title>This is a title</title><p>This is a paragraph</p><a  
href="#link">Here is an useful link</a>

When the markup is transformed to annotations, it is likely that the text from the document’s content will be as follows:

This is a titleThis is a paragraphHere is an useful link

The annotations created will refer the right parts of the texts but for the GATE’s processing resources like (tokenizer, gazetter, etc) which work on this text, this will be a major diaster. Therefore, in order to prevent this problem from happening, GATE checks if it’s likely to join words and if this happens then it inserts a space between those words. So, the text will look like this after loaded in GATE:

This is a title This is a paragraph Here is an useful link

There are cases when these words are meant to be joined, but they are just a few. This is why it’s an open problem.

The extensions associate with the XML reader are:

The web server content type associate with xml documents is: text/xml.

The magic numbers test searches inside the document for the XML(<?xml version="1.0") signature. It is also able to detect if the XML document uses the semantic described in the GATE document format DTD (see section 6.5.2) or uses other semantics.

Output [#]

GATE is capable to assure persistence for its resources. These layers of persistence are various and they span until database persistence. However, for some purposes, a light and simple level of persistence would be highly appreciated. The types of persistent storage used for Language Resources are:

We describe the latter case in here.

XML persistence doesn’t necessarily preserve all the objects belonging to the annotations, documents or corpora. Their features can be of all kinds of objects, with various layers of nesting. For example, lists containing lists containing maps, etc. Serializing these arbitrary data types in XML is not a simple task; GATE does the best it can, and supports native Java types such as Integers and Booleans, but where complex data types are used, information may be lost(the types will be converted into Strings). GATE provides a full serialization of certain types of features such as collections, strings and numbers. It is possible to serialize only those collections containing strings or numbers. The rest of other features are serialized using their string representation and when read back, they will be all strings instead of being the original objects. Consequences of this might be observed when performing evaluations(see the evaluation section).

When GATE outputs an XML document it may do so in one of two ways:

In the former case, the XML output will be close to the original document. In the latter case, the format is a GATE-specific one which can be read back by the system to recreate all the information that GATE held internally for the document.

In order to understand why there are two types of XML serialization, one needs to understand the structure of a GATE document. GATE allows a graph of annotations that refer to parts of the text. Those annotations are grouped under annotation sets. Because of this structure, sometimes it is impossible to save a document as XML using tags that surround the text referred by the annotation, because tags crossover situations could appear (XML is essentially a tree-based model of information, whereas GATE uses graphs). Therefore, in order to preserve all annotations in a GATE document, a custom type of XML document was developed.

The problem of crossover tags appears with GATE’s second option (the preserve format one), which is implemented at the cost of loosing certain annotations. The way it is applied in GATE is that it tries to restore the original markup and where it is possible, to add in the same manner annotations produced by GATE.

How to access and make use of the two ways of XML serialization

Save As XML option

This option is available in GATE’s GUI in the pop up menu associate with each language resource (document or corpus). Saving a corpus as XML is done by calling save as XML on each document of the corpus. This option saves all the annotations of a document together their features(applying the restrictions previously discussed), using the GateDocument.dtd :

 <!ELEMENT GateDocument (GateDocumentFeatures,  
           TextWithNodes, (AnnotationSet+))>  
 <!ELEMENT GateDocumentFeatures (Feature+)>  
 <!ELEMENT Feature (Name, Value)>  
 <!ELEMENT Name (\#PCDATA)>  
 <!ELEMENT Value (\#PCDATA)>  
 <!ELEMENT TextWithNodes (\#PCDATA | Node)*>  
 <!ELEMENT AnnotationSet (Annotation*)>  
 <!ATTLIST AnnotationSet  Name CDATA \#IMPLIED>  
 <!ELEMENT Annotation (Feature*)>  
 <!ATTLIST Annotation  Type      CDATA \#REQUIRED  
                       StartNode CDATA \#REQUIRED  
                       EndNode   CDATA \#REQUIRED>  
 <!ELEMENT Node EMPTY>  
 <!ATTLIST Node id CDATA \#REQUIRED>

The document is saved under a name chosen by the user and it may have any extension. However, the recommended extension would be “xml”.

Using GATE’s API, this option is available by calling gate.Document’s toXml() method. This method returns a string which is the XML representation of the document on which the method was called.

Note: It is recommended that the string representation to be saved on the file system using the UTF-8 encoding, as the first line of the string is : <?xml version="1.0" encoding="UTF-8"?>

Example of such a GATE format document:

<?xml version="1.0" encoding="UTF-8" ?>  
<GateDocument>  
 
<!-- The =document’s features-->  
 
<GateDocumentFeatures>  
<Feature>  
  <Name className="java.lang.String">MimeType</Name>  
  <Value className="java.lang.String">text/plain</Value>  
</Feature>  
<Feature>  
  <Name className="java.lang.String">gate.SourceURL</Name>  
  <Value className="java.lang.String">file:/G:/tmp/example.txt</Value>  
</Feature>  
</GateDocumentFeatures>  
 
<!-- The document content area with serialized nodes -->  
 
<TextWithNodes>  
<Node id="0"/>A TEENAGER <Node  
id="11"/>yesterday<Node id="20"/> accused his parents of cruelty  
by feeding him a daily diet of chips which sent his weight  
ballooning to 22st at the age of l2<Node id="146"/>.<Node  
id="147"/>  
</TextWithNodes>  
 
<!-- The default annotation set -->  
 
<AnnotationSet>  
<Annotation Type="Date" StartNode="11"  
EndNode="20">  
<Feature>  
  <Name className="java.lang.String">rule2</Name>  
  <Value className="java.lang.String">DateOnlyFinal</Value>  
</Feature> <Feature>  
  <Name className="java.lang.String">rule1</Name>  
  <Value className="java.lang.String">GazDateWords</Value>  
</Feature> <Feature>  
  <Name className="java.lang.String">kind</Name>  
  <Value className="java.lang.String">date</Value>  
</Feature> </Annotation> <Annotation Type="Sentence" StartNode="0"  
EndNode="147"> </Annotation> <Annotation Type="Split"  
StartNode="146" EndNode="147"> <Feature>  
  <Name className="java.lang.String">kind</Name>  
  <Value className="java.lang.String">internal</Value>  
</Feature> </Annotation> <Annotation Type="Lookup" StartNode="11"  
EndNode="20"> <Feature>  
  <Name className="java.lang.String">majorType</Name>  
  <Value className="java.lang.String">date_key</Value>  
</Feature> </Annotation>  
</AnnotationSet>  
 
<!-- Named annotation set -->  
 
<AnnotationSet Name="Original markups" >  
 <Annotation  
Type="paragraph" StartNode="0" EndNode="147"> </Annotation>  
</AnnotationSet>  
</GateDocument>

Note: One must know that all features that are not collections containing numbers or strings or that are not numbers or strings are discarded. With this option, GATE does not preserve those features it cannot restore back.

The preserve format option

This option is available in the GATE GUI from the popup menu of the annotations table. If no annotation in this table is selected, then the option will restore the document’s original markup. If certain annotations are selected, then the option will attempt to restore the original markup and insert all the selected ones. When an annotation violates the crossed over condition, that annotation is discarded and a message is issued by GATE.

This option makes possible to generate an XML document with tags surrounding the annotation’s refereed text and feature saved as attributes. All features which are collections, strings or numbers are saved, and the others are discarded. However, when read back, only the attributes under the GATE namespace (see bellow) are reconstructed back different than the others. That is because GATE does not store in the XML document the information about the features class and for collections the class of the items. So, when read back all features will become strings, except those under the GATE namespace.

One will notice that all generated tags have an attribute called “gateId” under the namespace “http://www.gate.ac.uk”. The attribute is used when the document is read back in GATE, in order to restore the annotation’s old ID. This feature is needed because it works in close cooperation with another attribute under the same namespace, called “matches”. This attribute indicates annotations/tags that refer the same entity1. They are under this namespace because GATE is sensitive to them and treats them differently then all other elements with their attributes which falls under the general reading algorithm described at the beginning of this section.

The “gateId” under GATE namespace is used to create an annotation which have as ID, the value indicated by this attribute. The “matches” attribute is used to create an ArrayList in which the items will be Integers, representing the ID of annotations that the current one matches.

Example:

If the text being processed is as follows:

<Person gate:gateId="23">John</Person> and <Person  
gate:gateId="25" gate:matches="23;25;30">John Major</Person> are  
the same person.

What GATE does when parses this text, is to create two annotations:

a1.type = "Person"  
a1.ID=Integer(23)  
a1.start=<the start offset of  
John>  
a1.end = <the end offset of John>  
a1.featureMap = {}  
 
a2.type="Person"  
a2.ID = Integer(25)  
a2.start= <the start offset  
of John Major>  
a2.end = <the end offset of John Major>  
a2.featureMap ={matches=[Integer(23); Integer(25); Integer(30)]}  

Under GATE’s API, this option is available by calling gate.Document’s toXml(Set aSetContainingAnnotations) method. This method returns a string which is the XML representation of the document on which the method was called. If called with null as a parameter, then the method will attempt to restore only the original markup. If the parameter is a set that contains annotations, then each annotation is tested against the crossover restriction, and for those found to violate it, a warning will be issued and they will be discarded.

In the next subsections we will show how this options applies to the other formats supported by GATE.

6.5.3 HTML

Input

HTML documents are parsed by GATE using the NekoHTML parser. The documents are read and created in GATE the same way as the XML documents.

The extensions associate with the HTML reader are:

The web server content type associate with html documents is: text/html.

The magic numbers test searches inside the document for the HTML(<html) signature.There are certain HTML documents that do not contain the HTML tag, so the magical numbers test might not hold.

There is a certain degree of customization for HTML documents in that GATE introduces new lines into the document’s text content in order to obtain a readable form. The annotations will refer the pieces of text as described in the original document but there will be a few extra new line characters inserted.

After reading H1,H2,H3,H4,H5,H6,TR,CENTER,LI,BR and DIV tags, GATE will introduce a new line(NL) char into the text. After a TITLE tag it will introduce two NLs. With P tags, GATE will introduce one NL at the beginning of the paragraph and one at the end of the paragraph. All newly added NLs are not considered to be part of the text contained by the tag.

Output

The Save as XML option works exactly the same for all GATE’s documents so there is no particular observation to be made for the HTML formats.

When attempting to preserve the original markup formatting, GATE will generate the document in xhtml. The html document will look the same with any browser after processed by GATE but it will be in another syntax.

6.5.4 SGML

Input

The SGML support in GATE is fairly light as there is no freely available Java SGML parser. GATE uses a light converter attempting to transform the input SGML file into a well formed XML. Because it does not make use of a DTD, the conversion might not be always good. It is advisable to perform a SGML2XML conversion outside the system(using some other specialized tools) before using the SGML document inside GATE.

The extensions associate with the SGML reader are:

The web server content type associate with xml documents is : text/sgml.

There is no magic numbers test for SGML.

Output

When attempting to preserve the original markup formatting, GATE will generate the document as XML because the real input of a SGML document inside GATE is an XML one.

6.5.5 Plain text

Input

When reading a plain text document, GATE attempts to detect its paragraphs and add “paragraph” annotations to the document’s “Original markups” annotation set. It does that by detecting two consecutive NLs. The procedure works for both UNIX like or DOS like text files.

Example:

If the plain text read is as follows:

Paragraph 1. This text belongs to the first paragraph.  
 
Paragraph 2. This text belongs to the second paragraph

then two “paragraph” type annotation will be created in the “Original markups” annotation set (refereing the first and second paragraphs ) with an empty feature map.

The extensions associate with the plain text reader are:

The web server content type associate with plain text documents is: text/plain.

There is no magic numbers test for plain text.

Output

When attempting to preserve the original markup formatting, GATE will dump XML markup that surrounds the text refereed.

The procedure described above applies both for plain text and RTF documents.

6.5.6 RTF

Input

Accessing RTF documents is performed by using the Java’s RTF editor kit. It only extracts the document’s text content from the RTF document.

The extension associate with the RTF reader is “rtf”.

The web server content type associate with xml documents is : text/rtf.

The magic numbers test searches for {\\rtf1.

Output

Same as the plain tex output.

6.5.7 Email

Input

GATE is able to read email messages packed in one document (UNIX mailbox format). It detects multiple messages inside such documents and for each message it creates annotations for all the fields composing an e-mail, like date, from, to, subject, etc. The message’s body is analyzed and a paragraph detection is performed (just like in the plain text case) . All annotation created have as type the name of the e-mail’s fields and they are placed in the Original markup annotation set.

Example:

From someone@zzz.zzz.zzz Wed Sep  6 10:35:50 2000  
 
Date: Wed, 6 Sep2000 10:35:49 +0100 (BST)  
 
From: forename1 surname2 <someone1@yyy.yyy.xxx>  
 
To: forename2 surname2 <someone2@ddd.dddd.dd.dd>  
 
Subject: A subject  
 
Message-ID: <Pine.SOL.3.91.1000906103251.26010A-100000@servername>  
MIME-Version: 1.0  
Content-Type: TEXT/PLAIN; charset=US-ASCII  
 
This text belongs to the e-mail body....  
 
This is a paragraph in the body of the e-mail  
 
This is another paragraph.

GATE attempts to detect lines such “From someone@zzz.zzz.zzz Wed Sep 6 10:35:50 2000” in the e-mail text. Those lines separate e-mail messages contained in one file. After that, for each field in the e-mail message annotation are created as follows:

The annotation type will be the name of the field, the feature map will be empty and the annotation will span from the end of the filed until the end of the line containing the e-mail field.

Example:

 
a1.type = "date" a1 spans between the two ^ ^. Date:^ Wed,  
6Sep2000 10:35:49 +0100 (BST)^  
 
a2.type = "from"; a2 spans between the two ^ ^. From:^ forename1  
surname2 <someone1@yyy.yyy.xxx>^

The extensions associate with the email reader are:

The web server content type associate with plain text documents is: text/email.

The magic numbers test searches for keywords like Subject:,etc.

Output

Same as plain text output.

6.6 XML Input/Output [#]

Support for input from and output to XML is described in section 6.5.2. In short:

When using the GATE framework, object representations of XML documents such as DOM or jDOM, or query and transformation languages such as X-Path or XSLT, may be used in parallel with GATE’s own Document representation (gate.Document) without conflicts.

Chapter 7
JAPE: Regular Expressions Over Annotations [#]

If Osama bin Laden did not exist, it would be necessary to invent him. For the past four years, his name has been invoked whenever a US president has sought to increase the defence budget or wriggle out of arms control treaties. He has been used to justify even President Bush’s missile defence programme, though neither he nor his associates are known to possess anything approaching ballistic missile technology. Now he has become the personification of evil required to launch a crusade for good: the face behind the faceless terror.

The closer you look, the weaker the case against Bin Laden becomes. While the terrorists who inflicted Tuesday’s dreadful wound may have been inspired by him, there is, as yet, no evidence that they were instructed by him. Bin Laden’s presumed guilt appears to rest on the supposition that he is the sort of man who would have done it. But his culpability is irrelevant: his usefulness to western governments lies in his power to terrify. When billions of pounds of military spending are at stake, rogue states and terrorist warlords become assets precisely because they are liabilities.

The need for dissent, George Monbiot, The Guardian, Tuesday September 18, 2001.

This chapter describes JAPE – a Java Annotation Patterns Engine. JAPE provides finite state transduction over annotations based on regular expressions. JAPE is a version of CPSL – Common Pattern Specification Language1.

JAPE allows you to recognise regular expressions in annotations on documents. Hang on, there’s something wrong here: a regular language can only describe sets of strings, not graphs, and GATE’s model of annotations is based on graphs. Hmmm. Another way of saying this: typically, regular expressions are applied to character strings, a simple linear sequence of items, but here we are applying them to a much more complex data structure. The result is that in certain cases the matching process in non-deterministic (i.e. the results are dependent on random factors like the addresses at which data is stored in the virtual machine): when there is structure in the graph being matched that requires more than the power of a regular automaton to recognise, JAPE chooses an alternative arbitrarily. However, this is not the bad news that it seems to be, as it turns out that in many useful cases the data stored in annotation graphs in GATE (and other language processing systems) can be regarded as simple sequences, and matched deterministically with regular expressions.

A JAPE grammar consists of a set of phases, each of which consists of a set of pattern/action rules. The phases run sequentially and constitute a cascade of finite state transducers over annotations. The left-hand-side (LHS) of the rules consist of an annotation pattern that may contain regular expression operators (e.g. *, ?, +). The right-hand-side (RHS) consists of annotation manipulation statements. Annotations matched on the LHS of a rule may be referred to on the RHS by means of labels that are attached to pattern elements.

At the beginning of each grammar, several options can be set:

Input annotations must also be defined at the start of each grammar. If no annotations are defined, all annotations will be matched.

There are 3 main ways in which the pattern can be specified:

Macros can also be used in the LHS of rules. This means that instead of expressing the information in the rule, it is specified in a macro, which can then be called in the rule. The reason for this is simply to avoid having to repeat the same information in several rules. Macros can themselves be used inside other macros.

New as of September 2008, in addition to referencing annotation features, JAPE allows access to other ”meta-properties” of an annotation. This is done by using an ”@” symbol rather than a ”.” symbol after the annotation type name. The three meta-properties that are built in are:

At this time, you cannot access the value of a ”meta-property” from a non-java RHS of a rule. (e.g. You can’t write: {X@length > "5"}:label-->:label.New = {somefeat = :label.X@length }. We hope to add this at some point.

The same union and kleen operators can be used as for the tokeniser rules, i.e.

|  
*  
?  
+

New as of late-September 2008, a range notation can also be added. e.g.

({Token})[1,3]

matches one to three Tokens in a row.

({Token.kind == number})[3]

matches exactly 3 number Tokens in a row.

The pattern description is followed by a label for the annotation. A label is denoted by a preceding colon; in the example below, the label is :location.

The RHS of the rule contains information about the annotation. Information about the annotation is transferred from the LHS of the rule using the label just described, and annotated with the entity type (which follows it). Finally, attributes and their corresponding values are added to the annotation. Alternatively, the RHS of the rule can contain Java code to create or manipulate annotations.

In the simple example below, the pattern described will be awarded an annotation of type “Enamex” (because it is an entity name). This annotation will have the attribute “kind”, with value “location”, and the attribute “rule”, with value “GazLocation”. (The purpose of the “rule” attribute is simply to ease the process of manual rule validation).

Rule: GazLocation  
(  
{Lookup.majorType == location}  
)  
:location -->  
 :location.Enamex = {kind="location", rule=GazLocation}

It is also possible to have more than one pattern and corresponding action, as shown in the rule below. On the LHS, each pattern is enclosed in a set of round brackets and has a unique label; on the RHS, each lable is associated with an action. In this example, the Lookup annotation is labelled “jobtitle” and is given the new annotation JobTitle; the TempPerson annotation is labelled “person” and is given the new annotation “Person”.

Rule: PersonJobTitle  
Priority: 20  
 
(  
 {Lookup.majorType == jobtitle}  
):jobtitle  
(  
 {TempPerson}  
):person  
-->  
    :jobtitle.JobTitle = {rule = "PersonJobTitle"},  
    :person.Person = {kind = "personName", rule = "PersonJobTitle"}

Similarly, labelled patterns can be nested, as in the example below, where the whole pattern is annotated as Person, but within the pattern, the jobtitle is annotated as JobTitle.

Rule: PersonJobTitle2  
Priority: 20  
 
(  
(  
 {Lookup.majorType == jobtitle}  
):jobtitle  
 {TempPerson}  
):person  
-->  
    :jobtitle.JobTitle = {rule = "PersonJobTitle"},  
    :person.Person = {kind = "personName", rule = "PersonJobTitle"}

JAPE provides limited support for copying annotation feature values from the left to the right hand side of a rule, for example:

Rule: LocationType  
 
(  
 {Lookup.majorType == location}  
):loc  
-->  
    :loc.Location = {rule = "LocationType", type = :loc.Lookup.minorType}

This will set the ”type” feature of the generated location to the value of the ”minorType” feature from the ”Lookup” annotation bound to the loc label. If the Lookup has no minorType, the Location will have no ”type” feature. The behaviour of newFeat = :bind.Type.oldFeat is:

Notice that the behaviour is deliberately underspecified if there is more than one Type annotation in bind. If you need more control, or if you want to copy several feature values from the same left hand side annotation, you should consider using Java code on the right hand side of your rule (see section 7.7).

Grammar rules can essentially be of two types. The first type of rule involves no gazetteer lookup, but can be defined using a small set of possible formats. In general, these are fairly straightforward and offer little potential for ambiguity.

The second type of rules rely more heavily on the gazetteer lists, and cover a much wider range of possibilities. This not only means that many rules may be needed to describe all situations, but that there is a much greater potential for ambiguity. This leads to the necessity for rule ordering and prioritisation, as will be discussed below.

For example, a single rule is sufficient to identify an IP address, because there is only one basic format - a series of numbers, each set connected by a dot. The rule for this is given below2:

Rule: IPAddress  
(  
 {Token.kind == number}  
 {Token.string == "."}  
 {Token.kind == number}  
 {Token.string == "."}  
 {Token.kind == number}  
 {Token.string == "."}  
 {Token.kind == number}  
)  
:ipAddress -->  
 :ipAddress.Address = {kind = "ipAddress"}

To identify a date or time, there are many possible variations, and so many rules are needed. For example, the same date information can appear in the following formats (amongst others):

 Wed, 10/7/00  
 Wed, 10/July/00  
 Wed, 10 July, 2000  
 Wed 10th of July, 2000  
 Wed. July 10th, 2000  
 Wed 10 July 2000

Different types of date can also be expressed. For example, the following would also be classified as date entities:

 the late ’80s  
 Monday  
 St. Andrew’s Day  
 99 BC  
 mid-November  
 1980-81  
 from March to April

This also means there is a much greater potential for ambiguity. For example, many of the months of the year can also be girls’ Christian names (e.g. May, June). This means that contextual information may be needed to disambiguate them, or we may have to guess which is more likely, based on frequency. For example, while “Friday” could be a person’s name (as in “Man Friday”), it is much more likely to be a day of the week.

Finally, macros can also be used on the RHS of rules. In this case, the label (which matches the label on the LHS of the rule) should be included in the macro. Below we give an example of using a macro on the RHS.

Macro: UNDERSCORES_OKAY          // separate  
:match                                              // lines  
{  
    gate.AnnotationSet matchedAnns = (gate.AnnotationSet)bindings.get("match");  
 
    int begOffset = matchedAnns.firstNode().getOffset().intValue();  
    int endOffset = matchedAnns.lastNode().getOffset().intValue();  
    String mydocContent = doc.getContent().toString();  
    String matchedString = mydocContent.substring(begOffset, endOffset);  
 
    gate.FeatureMap newFeatures = Factory.newFeatureMap();  
 
    if(matchedString.equals("Spanish"))     {  
     newFeatures.put("myrule",  "Lower");  
    }  
    else    {  
     newFeatures.put("myrule",  "Upper");  
    }  
 
    newFeatures.put("quality",  "1");  
    annotations.add(matchedAnns.firstNode(), matchedAnns.lastNode(),  
                              "Spanish_mark", newFeatures);  
}  
 
Rule: Lower  
(  
    ({Token.string == "Spanish"})  
:match)-->UNDERSCORES_OKAY   // no label here, only macro name  
 
Rule: Upper  
(  
    ({Token.string == "SPANISH"})  
:match)-->UNDERSCORES_OKAY   // no label here, only macro name  
 
 

7.1 Matching operators in detail [#]

This section gives more detail on the behaviour of the matching operators used on the left-hand side of JAPE rules.

7.1.1 Equality operators (“==” and “!=”)

The basic operator in JAPE is equality. {Lookup.majorType == "person"} matches a Lookup annotation whose majorType feature has the value “person”. Similarly {Lookup.majorType != "person"} would match any Lookup whose majorType feature does not have the value “person”. If a feature is missing it is treated as if it had an empty string as its value, so this would also match a Lookup annotation that did not have a majorType feature at all.

Certain type coercions are performed:

The != operator matches exactly when == doesn’t.

7.1.2 Comparison operators (“<”, “<=”, “>=” and “>”)

Comparison operators have their expected meanings, for example {Token.length > 3} matches a Token annotation whose length attribute is an integer greater than 3. The behaviour of the operators depends on the type of the constraint’s attribute:

7.1.3 Regular expression operators (“=~”, “==~”, “!~” and “!=~”) [#]

These operators match regular expressions. {Token.string =~ "[Dd]ogs"} matches a Token annotation whose string feature contains a substring that matches the regular expression [Dd]ogs, using !~ would match if the feature value does not contain a substring that matches the regular expression. The ==~ and !=~ operators are like =~ and !~ respectively, but require that the whole value match (or not match) the regular expression3. As with ==, missing features are treated as if they had the empty string as their value, so the constraint {Identifier.name ==~ "(?i)[aeiou]*"} would match an Identifier annotation which does not have a name feature, as well as any whose name contains only vowels.

The matching uses the standard Java regular expression library, so full details of the pattern syntax can be found in the JavaDoc documentation for java.util.regex.Pattern. There are a few specific points to note:

7.1.4 Contextual operators (“contains” and “within”) [#]

These operators match annotations within the context of other annotations.

For either operator, the right-hand value (Y in the above examples) can be a full constraint itself. For example {X contains {Y.foo=bar}} is also accepted. The operators can be used in a multi-constraint statement just like any of the traditional ones, so {X.f1 != "something", X contains {Y.foo=bar}} is valid.

It is possible to add additional custom operators without modifying the JAPE language. There are new init-time parameters to Transducer so that additional annotation ”meta-property” accessors and custom operators can be referenced at runtime. To add a custom operator, write a class that implements gate.jape.constraint.ConstraintPredicate, and then list that class name for the Transducer’s ”operators” property. Similarly, to add a custom ”meta-property” accessor, write a class that implements gate.jape.constraint.AnnotationAccessor, and then list that class name in the Transducer’s ”annotationAccessors” property.

7.2 Use of Context

Context can be dealt with in the grammar rules in the following way. The pattern to be annotated is always enclosed by a set of round brackets. If preceding context is to be included in the rule, this is placed before this set of brackets. This context is described in exactly the same way as the pattern to be matched. If context following the pattern needs to be included, it is placed after the label given to the annotation. Context is used where a pattern should only be recognised if it occurs in a certain situation, but the context itself does not form part of the pattern to be annotated.

For example, the following rule for Time (assuming an appropriate macro for “year”) would mean that a year would only be recognised if it occurs preceded by the words “in” or “by”:

Rule: YearContext1  
 
({Token.string == "in"}|  
 {Token.string == "by"}  
)  
(YEAR)  
:date -->  
 :date.Timex = {kind = "date", rule = "YearContext1"}

Similarly, the following rule (assuming an appropriate macro for “email”) would mean that an email address would only be recognised if it occurred inside angled brackets (which would not themselves form part of the entity):

Rule: Emailaddress1  
({Token.string == ‘‘<’’})  
(  
 (EMAIL)  
)  
:email  
({Token.string == ‘‘>’’})  
-->  
 :email.Address= {kind = "email", rule = "Emailaddress1"}

Also, it is possible to specify the constraint that one annotation must start at the same place as another. For example:

Rule: SurnameStartingWithDe  
(  
  {Token.string == "de",  
   Lookup.majorType == "name",  
   Lookup.minorType == "surname"}  
):de  
-->  
 :de.Surname = {prefix = "de"}

This rule would match anywhere where a Token with string “de” and a Lookup with majorType “name” and minorType “surname” start at the same offset in the text. Both the Lookup and Token annotations would be included in the :de binding, so the Surname annotation generated would span the longer of the two. Constraints on the same annotation type must be satisfied by a single annotation, so in this example there must be a single Lookup matching both the major and minor types – the rule would not match if there were two different lookups at the same location, one of them satisfying each constraint.

7.3 Use of Priority [#]

Each grammar has one of 5 possible control styles: “brill”, “all”, “first”, “once” and “appelt”. This is specified at the beginning of the grammar.

The Brill style means that when more than one rule matches the same region of the document, they are all fired. The result of this is that a segment of text could be allocated more than one entity type, and that no priority ordering is necessary. Brill will execute all matching rules starting from a given position and will advance and continue matching from the position in the document where the longest match finishes.

The ”all” style is similar to Brill, in that it will also execute all matching rules, but the matching will continue from the next offset to the current one.

For example, where [] are annotations of type Ann

[aaa[bbb]] [ccc[ddd]]

then a rule matching {Ann} and creating {Ann-2} for the same spans will generate:

BRILL: [aaabbb] [cccddd]  
ALL: [aaa[bbb]] [ccc[ddd]]

With the “first” style, a rule fires for the first match that’s found. This makes it unappropiate for rules that end in ”+” or ”?” or ”*”. Once a match is found the rule is fired; it does not attempt to get a longer match (as the other two styles do).

With the ”once” style, once a rule has fired, the whole JAPE phase exits after the first match.

With the appelt style, only one rule can be fired for the same region of text, according to a set of priority rules. Priority operates in the following way.

  1. From all the rules that match a region of the document starting at some point X, the one which matches the longest region is fired.
  2. If more than one rule matches the same region, the one with the highest priority is fired
  3. If there is more than one rule with the same priority, the one defined earlier in the grammar is fired.

An optional priority declaration is associated with each rule, which should be a positive integer. The higher the number, the greater the priority. By default (if the priority declaration is missing) all rules have the priority -1 (i.e. the lowest priority).

For example, the following two rules for location could potentially match the same text.

Rule:   Location1  
Priority: 25  
 
(  
 ({Lookup.majorType == loc_key, Lookup.minorType == pre}  
  {SpaceToken})?  
 {Lookup.majorType == location}  
 ({SpaceToken}  
  {Lookup.majorType == loc_key, Lookup.minorType == post})?  
)  
:locName -->  
  :locName.Location = {kind = "location", rule = "Location1"}  
 
 
Rule: GazLocation  
Priority: 20  
  (  
  ({Lookup.majorType == location}):location  
  )  
  -->   :location.Name = {kind = "location", rule=GazLocation}

Assume we have the text “China sea”, that “China” is defined in the gazetteer as “location”, and that sea is defined as a “loc_key” of type “post”. In this case, rule Location1 would apply, because it matches a longer region of text starting at the same point (“China sea”, as opposed to just “China”). Now assume we just have the text “China”. In this case, both rules could be fired, but the priority for Location1 is highest, so it will take precedence. In this case, since both rules produce the same annotation, so it is not so important which rule is fired, but this is not always the case.

One important point of which to be aware is that prioritisation only operates within a single grammar. Although we could make priority global by having all the rules in a single grammar, this is not ideal due to other considerations. Instead, we currently combine all the rules for each entity type in a single grammar. An index file (main.jape) is used to define which grammars should be used, and in which order they should be fired.

7.4 Use of negation [#]

All the examples in the preceding sections involve constraints that require the presence of certain annotations to match. JAPE also supports “negative” constraints which specify the absence of annotations. A negative constraint is signalled in the grammar by a “!” character.

Negative constraints are generally used in combination with positive ones to constrain the locations at which the positive constraint can match. For example:

Rule: PossibleName  
(  
 {Token.orth == "upperInitial", !Lookup}  
):name  
-->  
 :name.PossibleName = {}

This rule would match any uppercase-initial Token, but only where there is no Lookup annotation starting at the same location. The general rule is that a negative constraint matches at any location where the corresponding positive constraint would not match. Negative constraints do not contribute any annotations to the bindings - in the example above, the :name binding would contain only the Token annotation. The exception to this is when a negative constraint is used alone, without any positive constraints in the combination. In this case it binds all the annotations at the match position that do not match the constraint. Thus, {!Lookup} would bind all the annotations starting at this location except Lookups. In most cases, negative constraints should only be used in combination with positive ones.

Any constraint can be negated, for example:

Rule: SurnameNotStartingWithDe  
(  
 {Surname, !Token.string ==~ "[Dd]e"}  
):name  
-->  
 :name.NotDe = {}

This would match any Surname annotation that does not start at the same place as a Token with the string ”de” or ”De”. Note that this is subtly different from {Surname, Token.string !=~ "[Dd]e"}, as the second form requires a Token annotation to be present, whereas the first form (!Token...) will match if there is no Token annotation at all at this location.5

7.5 Useful tricks [#]

Although the JAPE language has some limitations as to how rules and patterns can be expressed, there are some useful tricks to overcome these problems.

7.6 Ontology aware grammar transduction [#]

GATE supports two different methods for ontology aware grammar transduction. Firstly it is possible to use the ontology feature both in grammars and annotations, while using the default transducer. Secondly it is possible to use an ontology aware transducer by passing an ontology language resource to one of the subsumes methods in SimpleFeatureMapImpl. This second strategy does not check for ontology features, which will make the writing of grammars easier, as there is no need to specify ontology when writing them. More information about the ontology-aware transducer can be found in Section 10.6.

7.7 Using Java code in JAPE rules [#]

The RHS of a JAPE rule can consist of any Java code. This is useful for removing temporary annotations and for percolating and manipulating features from previous annotations. In the example below

The first rule below shows a rule which matches a first person name, e.g. “Fred”, and adds a gender feature depending on the value of the minorType from the gazetteer list in which the name was found. We first get the bindings associated with the person label (i.e. the Lookup annotation). We then create a new annotation called “personAnn” which contains this annotation, and create a new FeatureMap to enable us to add features. Then we get the minorType features (and its value) from the personAnn annotation (in this case, the feature will be “gender” and the value will be “male”), and add this value to a new feature called “gender”. We create another feature “rule” with value “FirstName”. Finally, we add all the features to a new annotation “FirstPerson” which attaches to the same nodes as the original “person” binding.

Note that inputAS and outputAS represent the input and output annotation set. Normally, these would be the same (by default when using ANNIE, these will be the “Default” annotation set). Since the user is at liberty to change the input and output annotation sets in the paramters of the JAPE transducer at runtime, it cannot be guaranteed that the input and output annotation sets will be the same, and therefore we must specify the annotation set we are referring to.

Rule: FirstName  
 
(  
 {Lookup.majorType == person_first}  
):person  
-->  
{  
gate.AnnotationSet person = (gate.AnnotationSet)bindings.get("person");  
gate.Annotation personAnn = (gate.Annotation)person.iterator().next();  
gate.FeatureMap features = Factory.newFeatureMap();  
features.put("gender", personAnn.getFeatures().get("minorType"));  
features.put("rule", "FirstName");  
outputAS.add(person.firstNode(), person.lastNode(), "FirstPerson",  
features);  
}

The second rule (contained in a subsequent grammar phase) makes use of annotations produced by the first rule described above. Instead of percolating the minorType from the annotation produced by the gazetteer lookup, this time it percolates the feature from the annotation produced by the previous grammar rule. So here it gets the “gender” feature value from the “FirstPerson” annotation, and adds it to a new feature (again called “gender” for convenience), which is added to the new annotation (in outputAS) “TempPerson”. At the end of this rule, the existing input annotations (from inputAS) are removed because they are no longer needed. Note that in the previous rule, the existing annotations were not removed, because it is possible they might be needed later on in another grammar phase.

Rule: GazPersonFirst  
(  
 {FirstPerson}  
)  
:person  
-->  
{  
gate.AnnotationSet person = (gate.AnnotationSet)bindings.get("person");  
gate.Annotation personAnn = (gate.Annotation)person.iterator().next();  
gate.FeatureMap features = Factory.newFeatureMap();  
 
features.put("gender", personAnn.getFeatures().get("gender"));  
features.put("rule", "GazPersonFirst");  
outputAS.add(person.firstNode(), person.lastNode(), "TempPerson",  
features);  
inputAS.removeAll(person);  
}

7.7.1 Adding a feature to the document

The following example code shows how to add the feature “genre” with value “email” to the document, using JAVA code on the RHS of a rule:

Rule: Email  
Priority: 150  
 
(  
 {message}  
)  
-->  
{  
doc.getFeatures().put("genre", "email");  
}

7.7.2 Using named blocks [#]

For the common case where a Java block refers just to the annotations from a single left-hand-side binding, JAPE provides a shorthand notation:

Rule: RemoveDoneFlag  
 
(  
  {Instance.flag == "done"}  
):inst  
-->  
:inst{  
  Annotation theInstance = (Annotation)instAnnots.iterator().next();  
  theInstance.getFeatures().remove("flag");  
}

This rule is equivalent to the following:

Rule: RemoveDoneFlag  
 
(  
  {Instance.flag == "done"}  
):inst  
-->  
{  
  AnnotationSet instAnnots = (AnnotationSet)bindings.get("inst");  
  if(instAnnots != null && instAnnots.size() != 0) {  
    Annotation theInstance = (Annotation)instAnnots.iterator().next();  
    theInstance.getFeatures().remove("flag");  
  }  
}

A label :<label> on a Java block creates a local variable <label>Annots within the Java block which is the AnnotationSet bound to the <label> label. Also, the Java code in the block is only executed if there is at least one annotation bound to the label, so you do not need to check this condition in your own code. Of course, if you need more flexibility, e.g. to perform some action in the case where the label is not bound, you will need to use an unlabelled block and perform the bindings.get() yourself.

7.7.3 Java RHS overview [#]

When a JAPE grammar is parsed, a Jape parser creates action classes for all Java RHSs in the grammar. (one action class per RHS) RHS Java code will be embedded as a body of the method doIt and will work in context of this method. When a particular rule is fired, the method doIt will be executed.

Method doIt is specified by the interface gate.jape.RhsAction. Each action class implements this interface and is generated with the following template:

import java.io.*;  
import java.util.*;  
import gate.*;  
import gate.jape.*;  
import gate.creole.ontology.Ontology;  
import gate.annotation.*;  
import gate.util.*;  
class <AutogeneratedActionClassName>  
         implements java.io.Serializable, RhsAction {  
    public void doIt(Document doc,  
                     java.util.Map bindings,  
                     AnnotationSet annotations,  
                     AnnotationSet inputAS,  
                     AnnotationSet outputAS,  
                     Ontology ontology) {  
        // your RHS Java code will be embedded here  
...  
    }  
}

Method doIt has the following parameters that can be used in RHS Java code:

In your Java RHS you can use short names for all Java classes that are imported by the action class (plus Java classes from the packages that are imported by default according to JVM specification: java.lang.*, java.math.*). But you need to use fully qualified Java class names for all other classes. For example:

-->  
{  
  // VALID line examples  
  AnnotationSet as = ...  
  InputStream is = ...  
  java.util.logging.Logger myLogger =  
          java.util.logging.Logger.getLogger("JAPELogger");  
  java.sql.Statement stmt = ...  
 
  // INVALID line examples  
  Logger myLogger = Logger.getLogger("JapePhaseLogger");  
  Statement stmt = ...  
}

7.8 Optimising for speed [#]

The way in which grammars are designed can have a huge impact on the processing speed. Some simple tricks to keep the processing as fast as possible are:

7.9 Serializing JAPE Transducer [#]

JAPE grammars are written as files with the extension ”.jape”, which are parsed and compiled at run-time to execute them over the GATE document(s). Serialization of the JAPE Transducer adds the capability to serialize such grammar files and use them later to bootstrap new JAPE transducers, where they do not need the original JAPE grammar file. This allows people to distribute the serialized version of their grammars without disclosing the actual contents of their jape files. This is implemented as part of the JAPE Transducer PR. The following sections describe how to serialize and deserialize them.

7.9.1 How to serialize?

Once an instance of a JAPE transducer is created, the option to serialize it appears in the option menu of that instance. The option menu can be activated by right clicking on the respective PR. Having done so, it asks for the file name where the serialized version of the respective JAPE grammar is stored.

7.9.2 How to use the serialized grammar file?

The JAPE Transducer now also has an init-time parameter binaryGrammarURL, which appears as an optional parameter to the grammarURL. The User can use this parameter (i.e. binaryGrammarURL) to specify the serialized grammar file.

7.10 The JAPE Debugger [#]

The Jape debugger helps to find errors in Jape programs enabling the user to see in detail how a Jape rule works when applied to a particular range of text. It was written by Ontos, who also provided the original version of this documentation. The debugger allows the user to select a particular part of the text, and then look at the detailed history of processing. This will enable them to see which rules were matched and which were not, and also why particular rules were or were not matched. It is also possible to set breakpoints for particular rules, enabling the user to see how the rule was matched, and what annotations were created.

The Jape debugger could be useful in situations where the old simple DEBUG OUTPUT method does not help. For example when:

7.10.1 Debugger GUI

The layout of the JAPE-debugger user interface is shown in Figure 7.1.


PIC


Figure 7.1: The JAPE Debugger User Interface


The debugger’s main frame consists of the following primary components:

7.10.2 Using the Debugger

In most situations you will use the debugger in trace mode using the following steps:

After these steps the following information becomes available. In the Resources tree some of the rules become highlighted in different colors:

Trace history is the main debugging tab in Debugging panel. It contains the source of the JAPE rule currently selected, and the selected text in the document panel. All the inputs are shown, and matched inputs are highlighted in green. Annotations, which made the rule fail, are highlighted in red. If a rule tried to match more than one time on the selected text interval, buttons on the top of the panel (Previous and Next) become enabled, and allow one to observe all the matching attempts of the rule. Clicking on any of the inputs shows an annotation window, and the tool tip of the matched words gives the template in the rule.

Step by Step Example

To give an idea of how to use debugger for fixing bugs, lets consider the following example. For instance, there is a rule named PersonFullExt, which should find person names: A. B. Dick, J. F. Kennedy and so on, and create an annotation Person. To test the rule, we run GATE on a text fragment containing the following words: the J.L. Kellogg Graduate School, so we would expect that the part of the text J. L. Kellogg should get an annotation Person. Unfortunately, we encounter a problem (because only L. Kellogg was matched), so we decide to use the debugger to find the reason for this unexpected behavior. With JAPE-debugger, it is possible to observe everything needed during for finding and fixing the error.

The appropriate screenshot is shown in Figure 7.2.


PIC


Figure 7.2: Finding Errors


As you can see, the rule NotPersonFull matched the text ‘the J’, so the rule PersonFullExt could start matching only after the pointer has moved to the token ‘.’. Without the debugger, it wouldn’t be so easy to find the reason for this error, because the rule NotPersonFull doesn’t create any annotations.

An additional feature of the debugger is the availability of debugging with breakpoints (Jape Rule Tab). After setting a breakpoint on a given rule (in our case it is the rule named TheOrgXBase), the GATE transducer will be interrupted at the breakpoint, and in the document panel the text that is currently matched by the rule (it is highlighted in cyan) will be displayed. In the tab, a special table representation of the rule (with what it matches on the left side), and the history of annotations created by this rule, will be displayed, as in Figure 7.3.


PIC


Figure 7.3: The Interface of the JAPE Debugger while Running in Breakpoint Mode


7.10.3 Known Bugs

1. Debugger doesn’t see processing resource reinitialization. A possible workaround is to close and open the resource again.

7.11 Notes for Montreal Transducer users [#]

In June 2008, the standard JAPE transducer implementation gained a number of features inspired by Luc Plamondon’s ”Montreal Transducer”, which has been available as a GATE plugin for several years. If you have existing Montreal Transducer grammars and want to update them to work with the standard JAPE implementation you should be aware of the following differences in behaviour:

Chapter 8
ANNIE: a Nearly-New Information Extraction System [#]

And so the time had passed predictably and soberly enough in work and routine chores, and the events of the previous night from first to last had faded; and only now that both their days’ work was over, the child asleep and no further disturbance anticipated, did the shadowy figures from the masked ball, the melancholy stranger and the dominoes in red, revive; and those trivial encounters became magically and painfully interfused with the treacherous illusion of missed opportunities. Innocent yet ominous questions and vague ambiguous answers passed to and fro between them; and, as neither of them doubted the other’s absolute candour, both felt the need for mild revenge. They exaggerated the extent to which their masked partners had attracted them, made fun of the jealous stirrings the other revealed, and lied dismissively about their own. Yet this light banter about the trivial adventures of the previous night led to more serious discussion of those hidden, scarcely admitted desires which are apt to raise dark and perilous storms even in the pureset, most transparent soul; and they talked about those secret regions for which they felt hardly any longing, yet towards which the irrational wings of fate might one day drive them, if only in their dreams. For however much they might belong to one another heart and soul, they knew last night was not the first time they had been stirred by a whiff of freedom, danger and adventure.

Dream Story, Arthur Schnitzler, 1926 (pp. 4-5).

GATE was originally developed in the context of Information Extraction (IE) R&D, and IE systems in many languages and shapes and sizes have been created using GATE with the IE components that have been distributed with it (see [Maynard et al. 00] for descriptions of some of these projects).1

GATE is distributed with an IE system called ANNIE, A Nearly-New IE system (developed by Hamish Cunningham, Valentin Tablan, Diana Maynard, Kalina Bontcheva, Marin Dimitrov and others). ANNIE relies on finite state algorithms and the JAPE language (see chapter 7).

ANNIE components form a pipeline which appears in figure 8.1.


PIC


Figure 8.1: ANNIE and LaSIE


ANNIE components are included with GATE (though the linguistic resources they rely on are generally more simple than the ones we use in-house). The rest of this chapter describes these components.

8.1 Tokeniser [#]

The tokeniser splits the text into very simple tokens such as numbers, punctuation and words of different types. For example, we distinguish between words in uppercase and lowercase, and between certain types of punctuation. The aim is to limit the work of the tokeniser to maximise efficiency, and enable greater flexibility by placing the burden on the grammar rules, which are more adaptable.

8.1.1 Tokeniser Rules

A rule has a left hand side (LHS) and a right hand side (RHS). The LHS is a regular expression which has to be matched on the input; the RHS describes the annotations to be added to the AnnotationSet. The LHS is separated from the RHS by ’>’. The following operators can be used on the LHS:

| (or)  
* (0 or more occurrences)  
? (0 or 1 occurrences)  
+ (1 or more occurrences)

The RHS uses ’;’ as a separator, and has the following format:

{LHS} > {Annotation type};{attribute1}={value1};...;{attribute  
n}={value n}

Details about the primitive constructs available are given in the tokeniser file (DefaultTokeniser.Rules).

The following tokeniser rule is for a word beginning with a single capital letter:

"UPPERCASE_LETTER" "LOWERCASE_LETTER"* >  
  Token;orth=upperInitial;kind=word;

It states that the sequence must begin with an uppercase letter, followed by zero or more lowercase letters. This sequence will then be annotated as type “Token”. The attribute “orth” (orthography) has the value “upperInitial”; the attribute “kind” has the value “word”.

8.1.2 Token Types

In the default set of rules, the following kinds of Token and SpaceToken are possible:

Word

A word is defined as any set of contiguous upper or lowercase letters, including a hyphen (but no other forms of punctuation). A word also has the attribute “orth”, for which four values are defined:

Number

A number is defined as any combination of consecutive digits. There are no subdivisions of numbers.

Symbol

Two types of symbol are defined: currency symbol (e.g. ‘$’, ‘£’) and symbol (e.g. ‘&’, ‘ˆ  ’). These are represented by any number of consecutive currency or other symbols (respectively).

Punctuation

Three types of punctuation are defined: start_punctuation (e.g. ‘(’), end_punctuation (e.g. ‘)’), and other punctuation (e.g. ‘:’). Each punctuation symbol is a separate token.

SpaceToken

White spaces are divided into two types of SpaceToken - space and control - according to whether they are pure space characters or control characters. Any contiguous (and homogenous) set of space or control characters is defined as a SpaceToken.

The above description applies to the default tokeniser. However, alternative tokenisers can be created if necessary. The choice of tokeniser is then determined at the time of text processing.

8.1.3 English Tokeniser [#]

The English Tokeniser is a processing resource that comprises a normal tokeniser and a JAPE transducer (see chapter7). The transducer has the role of adapting the generic output of the tokeniser to the requirements of the English part-of-speech tagger. One such adaptation is the joining together in one token of constructs like “ ’30s”, “ ’Cause”, “ ’em”, “ ’N”, “ ’S”, “ ’s”, “ ’T”, “ ’d”, “ ’ll”, “ ’m”, “ ’re”, “ ’til”, “ ’ve”, etc. Another task of the JAPE transducer is to convert negative constructs like “don’t” from three tokens (“don”, “ ’ ” and “t”) into two tokens (“do” and “n’t”).

The English Tokeniser should always be used on English texts that need to be processed afterwards by the POS Tagger.

8.2 Gazetteer [#]

The gazetteer lists used are plain text files, with one entry per line. Each list represents a set of names, such as names of cities, organisations, days of the week, etc.

Below is a small section of the list for units of currency:

Ecu  
European Currency Units  
FFr  
Fr  
German mark  
German marks  
New Taiwan dollar  
New Taiwan dollars  
NT dollar  
NT dollars

An index file (lists.def) is used to access these lists; for each list, a major type is specified and, optionally, a minor type 2. In the example below, the first column refers to the list name, the second column to the major type, and the third to the minor type. These lists are compiled into finite state machines. Any text tokens that are matched by these machines will be annotated with features specifying the major and minor types. Grammar rules then specify the types to be identified in particular circumstances. Each gazetteer list should reside in the same directory as the index file.

currency_prefix.lst:currency_unit:pre_amount  
currency_unit.lst:currency_unit:post_amount  
date.lst:date:specific  
day.lst:date:day

So, for example, if a specific day needs to be identified, the minor type “day” should be specified in the grammar, in order to match only information about specific days; if any kind of date needs to be identified,the major type “date” should be specified, to enable tokens annotated with any information about dates to be identified. More information about this can be found in the following section.

In addition, the gazetteer allows arbitrary feature values to be associated with particular entries in a single list. ANNIE does not use this capability, but to enable it for your own gazetteers, set the optional gazetteerFeatureSeparator parameter to a single character (or an escape sequence such as \t or \uNNNN) when creating a gazetteer. In this mode, each line in a .lst file can have feature values specified, for example, with the following entry in the index file:

software_company.lst:company:software

the following software_company.lst:

Red Hat&stockSymbol=RHAT  
Apple Computer&abbrev=Apple&stockSymbol=AAPL  
Microsoft&abbrev=MS&stockSymbol=MSFT

and gazetteerFeatureSeparator set to &, the gazetteer will annotate Red Hat as a Lookup with features majorType=company, minorType=software and stockSymbol=RHAT. Note that you do not have to provide the same features for every line in the file, in particular it is possible to provide extra features for some lines in the list but not others.

Here is a full list of the parameters used by the Default Gazetteer:

Init-time parameters

listsURL
A URL pointing to the index file (ususally lists.def) that contains the list of pattern lists.
encoding
The character encoding to be used while reading the pattern lists.
gazetteerFeatureSeparator
The character used to add arbitrary features to gazetteer entries. See above for an example.
caseSensitive
Should the gazetteer be case sensitive during matching.

Run-time parameters

document
The document to be preocessed.
annotationSetName
The name for annotation set where the resulting Lookup annotations will be created.
wholeWordsOnly
Should the gazetteer only match whole words? If set to true, a string segment in the input document will only be matched if it is bordered by characters that are not letters, non spacing marks, or combining spacing marks (as identified by the Unicode standard).
longestMatchOnly
Should the gazetteer only match the longest possible string starting from any position. This parameter is only relevant when the list of lookups contains proper prefixes of other entries (e.g when both “Dell” and “Dell Europe” are in the lists). The default behaviour (when this parameter is set to true) is to only match the longest entry, “Dell Europe” in this example. This is the default GATE gazetteer behaviour since version 2.0. Setting this parameter to false will cause the gazetteer to match all possible prefixes.

8.3 Sentence Splitter [#]

The sentence splitter is a cascade of finite-state transducers which segments the text into sentences. This module is required for the tagger. The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.

Each sentence is annotated with the type Sentence. Each sentence break (such as a full stop) is also given a “Split” annotation. This has several possible types: “.”, “punctuation”, “CR” (a line break) or “multi” (a series of punctuation marks such as “?!?!”.

The sentence splitter is domain and application-independent.

There is an alternative ruleset for the Sentence Splitter which considers newlines and carriage returns differently. In general this version should be used when a new line on the page indicates a new sentence). To use this alternative version, simply load the main-single-nl.jape from the default location instead of main.jape (the default file) when asked to select the location of the grammar file to be used.

8.4 RegEx Sentence Splitter [#]

The RegEx sentence splitter is an alternative to the standard ANNIE Sentence Splitter. Its main aim is to address some performance issues identified in the JAPE-based splitter, mainly do to with improving the execution time and robustness, especially when faced with irregular input.

As its name suggests, the RegEx splitter is based on regular expressions, using the default Java implementation.

The new splitter is configured by three files containing (Java style, see http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html) regular expressions, one regex per line. The three different files encode patterns for:

internal splits
sentence splits that are part of the sentence, such as sentence ending punctuation;
external splits
sentence splits that are NOT part of the sentence, such as 2 consecutive new lines;
non splits
text fragments that might be seen as splits but they should be ignored (such as full stops occurring inside abbreviations).

The new splitter comes with an initial set of patterns that try to emulate the behaviour of the original splitter (apart from the situations where the original one was obviously wrong, like not allowing sentences to start with a number).

Here is a full list of the parameters used by the RegEx Sentence Splitter:

Init-time parameters

encoding
The character encoding to be used while reading the pattern lists.
externalSplitListURL
URL for the file containing the list of external split patterns;
internalSplitListURL
URL for the file containing the list of internal split patterns;
nonSplitListURL
URL for the file containing the list of non split patterns;

Run-time parameters

document
The document to be preocessed.
outputASName
The name for annotation set where the resulting Split and Sentence annotations will be created.

8.5 Part of Speech Tagger [#]

The tagger [Hepple 00] is a modified version of the Brill tagger, which produces a part-of-speech tag as an annotation on each word or symbol. The list of tags used is given in Appendix E. The tagger uses a default lexicon and ruleset (the result of training on a large corpus taken from the Wall Street Journal). Both of these can be modified manually if necessary. Two additional lexicons exist - one for texts in all uppercase (lexicon_cap), and one for texts in all lowercase (lexicon_lower). To use these, the default lexicon should be replaced with the appropriate lexicon at load time. The default ruleset should still be used in this case.

The ANNIE Part-of-Speech tagger requires the following parameters.

If - (inputASName == outputASName) AND (outputAnnotationType == baseTokenAnnotationType)

then - New features are added on existing annotations of type “baseTokenAnnotationType”.

otherwise - Tagger searches for the annotation of type “outputAnnotationType” under the “outputASName” annotation set that has the same offsets as that of the annotation with type “baseTokenAnnotationType”. If it succeeds, it adds new feature on a found annotation, and otherwise, it creates a new annotation of type “outputAnnotationType” under the “outputASName” annotation set.

8.6 Semantic Tagger [#]

ANNIE’s semantic tagger is based on the JAPE language – see chapter 7. It contains rules which act on annotations assigned in earlier phases, in order to produce outputs of annotated entities.

8.7 Orthographic Coreference (OrthoMatcher) [#]

(Note: this component was previously known as a ”NameMatcher”.)

The Orthomatcher module adds identity relations between named entities found by the semantic tagger, in order to perform coreference. It does not find new named entities as such, but it may assign a type to an unclassified proper name, using the type of a matching name.

The matching rules are only invoked if the names being compared are both of the same type, i.e. both already tagged as (say) organisations, or if one of them is classified as ‘unknown’. This prevents a previously classified name from being recategorised.

8.7.1 GATE Interface

Input – entity annotations, with an id attribute.

Output – matches attributes added to the existing entity annotations.

8.7.2 Resources

A lookup table of aliases is used to record non-matching strings which represent the same entity, e.g. “IBM” and “Big Blue”, “Coca-Cola” and “Coke”. There is also a table of spurious matches, i.e. matching strings which do not represent the same entity, e.g. “BT Wireless” and “BT Cellnet” (which are two different organizations). The list of tables to be used is a load time parameter of the orthomatcher: a default list is set but can be changed as necessary.

8.7.3 Processing

The wrapper builds an array of the strings, types and IDs of all name annotations, which is then passed to a string comparison function for pairwise comparisons of all entries.

8.8 Pronominal Coreference [#]

The pronominal coreference module performs anaphora resolution using the JAPE grammar formalism. Note that this module is not automatically loaded with the other ANNIE modules, but can be loaded separately as a Processing Resource. The main module consists of three submodules:

The first two modules are helper submodules for the pronominal one, because they do not perform anything related to coreference resolution except the location of quoted fragments and pleonastic it occurrences in text. They generate temporary annotations which are used by the pronominal submodule (such temporary annotations are removed later).

The main coreference module can operate successfully only if all ANNIE modules were already executed. The module depends on the following annotations created from the respective ANNIE modules:

For each pronoun (anaphor) the coreference module generates an annotation of type ”Coreference” containing two features:

8.8.1 Quoted Speech Submodule

The quoted speech submodule identifies quoted fragments in the text being analysed. The identified fragments are used by the pronominal coreference submodule for the proper resolution of pronouns such as I, me, my, etc. which appear in quoted speech fragments. The module produces ”Quoted Text” annotations.

The submodule itself is a JAPE transducer which loads a JAPE grammar and builds an FSM over it. The FSM is intended to match the quoted fragments and generate appropriate annotations that will be used later by the pronominal module.

The JAPE grammar consists of only four rules, which create temporary annotations for all punctuation marks that may enclose quoted speech, such as ”, ’, “, etc. These rules then try to identify fragments enclosed by such punctuation. Finally all temporary annotations generated during the processing, except the ones of type ”Quoted Text”, are removed (because no other module will need them later).

8.8.2 Pleonastic It submodule

The pleonastic it submodule matches pleonastic occurrences of ”it”. Similar to the quoted speech submodule, it is a JAPE transducer operating with a grammar containing patterns that match the most commonly observed pleonastic it constructs.

8.8.3 Pronominal Resolution Submodule

The main functionality of the coreference resolution module is in the pronominal resolution submodule. This uses the result from the execution of the quoted speech and pleonastic it submodules. The module works according to the following algorithm:

8.8.4 Detailed description of the algorithm

Full details of the pronominal coreference algorithm are as dollows.

Preprocessing

The preprocessing task includes the following subtasks:

Pronoun resolution

This task includes the following subtasks:

etrieving all the pronouns in the document. Pronouns are represented as annotations of type ”Token” with feature ”category” having value ”PRP” or ”PRP$”. The former classifies possessive adjectives such as my, your, etc. and the latter classifies personal, reflexive etc. pronouns. The two types of pronouns are combined in one list and sorted according to their offset in the text.

For each pronoun in the list the following actions are performed:

Coreference chain generation

This step is actually performed by the main module. After executing each of the submodules on the current document, the coreference module follows the steps:

The resolution for she, her, her$, he, him, his, herself and himself is similar because the analysis of the corpus showed that these pronouns are related to their antecedents in similar manner. The characteristics of the resolution process are:

The resolution process performs the following steps:

Resolution of it, its, itself

This set of pronouns also shares many common characteristics. The resolution process contains certain differences with the one for the previous set of pronouns. Successful resolution for it, its, itself is more difficult because of the following factors:

Resolution of I, me, my, myself

Resolution of these pronouns is dependent on the work of the quoted speech submodule. One important difference from the resolution process of other pronouns is that the context is not measured in sentences but depends solely on the quote span. Another difference is that the context is not contiguous - the quoted fragment itself is excluded from the context, because it is unlikely that an antecedent for I, me, etc. appears there. The context itself consists of:

It is worth noting that contrary to other pronouns, the antecedent for I, me, my and myself is most often cataphoric or if anaphoric it is not in the same sentence with the quoted fragment.

The resolution algorithm consists of the following steps:

8.9 A Walk-Through Example [#]

Let us take an example of a 3-stage procedure using the tokeniser, gazetteer and named-entity grammar. Suppose we wish to recognise the phrase “800,000 US dollars” as an entity of type “Number”, with the feature “money”.

First of all, we give an example of a grammar rule (and corresponding macros) for money, which would recognise this type of pattern.

 
Macro: MILLION_BILLION  
({Token.string == "m"}|  
{Token.string == "million"}|  
{Token.string == "b"}|  
{Token.string == "billion"}  
)  
 
Macro: AMOUNT_NUMBER  
({Token.kind == number}  
(({Token.string == ","}|  
  {Token.string == "."})  
{Token.kind == number})*  
(({SpaceToken.kind == space})?  
 (MILLION_BILLION)?)  
)  
 
Rule: Money1  
// e.g. 30 pounds  
  (  
      (AMOUNT_NUMBER)  
      (SpaceToken.kind == space)?  
      ({Lookup.majorType == currency_unit})  
  )  
 :money  -->  
  :money.Number = {kind = "money", rule = "Money1"}

8.9.1 Step 1 - Tokenisation

The tokeniser separates this phrase into the following tokens. In general, a word is comprised of any number of letters of either case, including a hyphen, but nothing else; a number is composed of any sequence of digits; punctuation is recognised individually (each character is a separate token), and any number of consecutive spaces and/or control characters are recognised as a single spacetoken.

Token, string = ‘‘800’’, kind = number, length = 3  
Token, string = ‘‘,’’, kind = punctuation, length = 1  
Token, string = ‘‘000’’, kind = number, length = 3  
SpaceToken, string = ‘‘ ’’, kind = space, length = 1  
Token, string = ‘‘US’’, kind = word, length = 2, orth = allCaps  
SpaceToken, string = ‘‘ ’’, kind = space, length = 1  
Token, string = ‘‘dollars’’, kind = word, length = 7, orth = lowercase

8.9.2 Step 2 - List Lookup

The gazetteer lists are then searched to find all occurrences of matching words in the text. It finds the following match for the string “US dollars”:

Lookup, minorType = post_amount, majorType = currency_unit

8.9.3 Step 3 - Grammar Rules

The grammar rule for money is then invoked. The macro MILLION_BILLION recognises any of the strings “m”, “million”, “b”, “billion”. Since none of these exist in the text, it passes onto the next macro. The AMOUNT_NUMBER macro recognises a number, optionally followed by any number of sequences of the form“dot or comma plus number”, followed by an optional space and an optional MILLION_BILLION. In this case, “800,000” will be recognised. Finally, the rule Money1 is invoked. This recognises the string identified by the AMOUNT_NUMBER macro, followed by an optional space, followed by a unit of currency (as determined by the gazetteer). In this case, “US dollars” has been identified as a currency unit, so the rule Money1 recognises the entire string “800,000 US dollars”. Following the rule, it will be annotated as a Number entity of type Money:

 Number, kind = money, rule = Money1 

Chapter 9
(More CREOLE) Plugins [#]

For the previous reader was none other than myself. I had already read this book long ago.

The old sickness has me in its grip again: amnesia in litteris, the total loss of literary memory. I am overcome by a wave of resignation at the vanity of all striving for knowledge, all striving of any kind. Why read at all? Why read this book a second time, since I know that very soon not even a shadow of a recollection will remain of it? Why do anything at all, when all things fall apart? Why live, when one must die? And I clap the lovely book shut, stand up, and slink back, vanquished, demolished, to place it again among the mass of anonymous and forgotten volumes lined up on the shelf.

But perhaps - I think, to console myself - perhaps reading (like life) is not a matter of being shunted on to some track or abruptly off it. Maybe reading is an act by which consciousness is changed in such an imperceptible manner that the reader is not even aware of it. The reader suffering from amnesia in litteris is most definitely changed by his reading, but without noticing it, necause as he reads, those critical faculties of his brain that could tell him that change is occurring are changing as well. And for one who is himself a writer, the sickness may conceivably be a blessing, indeed a necessary precondition, since it protects him against that crippling awe which every great work of literature creates, and because it allows him to sustain a wholly uncomplicated relationship to plagiarism, without which nothing original can be created.

Three Stories and a Reflection, Patrick Suskind, 1995 (pp. 82, 86).

This chapter describes additional CREOLE resources which do not form part of ANNIE.

9.1 Document Reset [#]

The document reset resource enables the document to be reset to its original state, by removing all the annotation sets and their contents, apart from the one containing the document format analysis (Original Markups). An optional parameter, keepOriginalMarkupsAS, allows users to decide whether to keep the Original Markups AS or not while reseting the document. This resource is normally added to the beginning of an application, so that a document is reset before an application is rerun on that document.

9.2 Verb Group Chunker [#]

The rule-based verb chunker is based on a number of grammars of English [Cobuild 99Azar 89]. We have developed 68 rules for the identification of non recursive verb groups. The rules cover finite (’is investigating’), non-finite (’to investigate’), participles (’investigated’), and special verb constructs (’is going to investigate’). All the forms may include adverbials and negatives. The rules have been implemented in JAPE. The finite state analyser produces an annotation of type ’VG’ with features and values that encode syntactic information (’type’, ’tense’, ’voice’, ’neg’, etc.). The rules use the output of the POS tagger as well as information about the identity of the tokens (e.g. the token ’might’ is used to identify modals).

The grammar for verb group identification can be loaded as a Jape grammar into the GATE architecture and can be used in any application: the module is domain independent.

9.3 Noun Phrase Chunker [#]

The NP Chunker application is a Java implementation of the Ramshaw and Marcus BaseNP chunker (in fact the files in the resources directory are taken straight from their original distribution) which attempts to insert brackets marking noun phrases in text which have been marked with POS tags in the same format as the output of Eric Brill’s transformational tagger. The output from this version should be identical to the output of the oringinal C++/Perl version released by Ramshaw and Marcus.

For more information about baseNP structures and the use of tranformation-based learning to derive them, see [Ramshaw & Marcus 95].

9.3.1 Differences from the Original

The major difference is the assumption is made that if a POS tag is not in the mapping file then it is tagged as ’I’. The original version simply failed if an unknown POS tag was encountered. When using the GATE wrapper the chunk tag can be changed from ’I’ to any other legal tag (B or O) by setting the unknownTag parameter.

9.3.2 Using the Chunker

The Chunker requires the Creole plugin ”NP_Chunking” to be loaded. The two loadtime parameters are simply urls pointing at the POS tag dictionary and the rules file, which should be set automatically. There are five runtime parameters which should be set prior to executing the chunker.

The chunker requires the following PRs to have been run first: tokeniser, sentence splitter, POS tagger.

9.4 OntoText Gazetteer [#]

The OntoText Gazetteer is a Natural Gazetteer, implemented from the OntoText Lab (http://www.ontotext.com/). Its implementaion is based on simple lookup in several java.util.HashMap, and is inspired by the strange idea of Atanas Kiryakov, that searching in HashMaps will be faster than a search in a Finite State Machine (FSM).

Here follows a description of the algorithm that lies behind this implementation:

Every phrase i.e. every list entry is separated into several parts. The parts are determined by the whitespaces lying among them. e.g. the phrase : ”form is emptiness” has three parts : form, is & emptiness. There is also a list of HashMaps: mapsList which has as many elements as the longest (in terms of ”count of parts”) phrase in the lists. So the first part of a phrase is placed in the first map. The first part + space + second part is placed in the second map, etc. The full phrase is placed in the appropriate map, and a reference to a Lookup object is attached to it.

On first sight it seems that this algorithm is certainly much more memory-consuming than a finite state machine (FSM) with the parts of the phrases as transitions, but this is actually not so important since the average length of the phrases (in parts) in the lists is 1.1. On the other hand, one advantage of the algorithm is that, although unconventional, on average it takes four times less memory and works three times faster than an optimized FSM implementation.

The lookup part is implemented in execute() so a lot of tokenization takes place there. After defining the candidates for phrase-parts, we build a candidate phrase and try to look it up in the maps (in which map again depends on the count of parts in the current candidate phrase).

9.4.1 Prerequisites

The phrases to be recognised should be listed in a set of files, one for each type of occurrence (as for the standard gazetteer).

The gazetteer is built with the information from a file that contains the set of lists (which are files as well) and the associated type for each list. The file defining the set of lists should have the following syntax: each list definition should be written on its own line and should contain:

The elements of each definition are separated by ”:”. The following is an example of a valid definition:

personmale.lst:person:male:english

Each file named in the lists definition file is just a list containing one entry per line.

When this gazetter is run over some input text (a GATE document) it will generate annotations of type Lookup having the attributes specified in the definition file.

9.4.2 Setup

In order to use this gazetteer from within GATE the following should reside in the creole setup file (creole.xml):

    <RESOURCE>  
      <NAME>OntoText Gazetteer</NAME>  
      <CLASS>com.ontotext.gate.gazetteer.NaturalGazetteer</CLASS>  
      <COMMENT>A list lookup component. for documentation please refer to  
(www.ontotext.com/gate/gazetteer/documentation/index.html). For licence  
information please refer to  
(www.ontotext.com/gate/gazetteer/documentation/licence.ontotext.html) or to  
licence.ontotext.html in the lib folder of  
GATE</COMMENT>  
      <PARAMETER NAME="document" RUNTIME="true" COMMENT="The document to be  
processed">gate.Document</PARAMETER>  
      <PARAMETER NAME="annotationSetName" RUNTIME="true" COMMENT="The  
annotation set to be used for the generated  
annotations" OPTIONAL="true">java.lang.String</PARAMETER>  
      <PARAMETER NAME="listsURL"  
DEFAULT="gate:/creole/gazeteer/default/lists.def" COMMENT="The URL to the  
file with list of  
lists" SUFFIXES="def">java.net.URL</PARAMETER>  
      <PARAMETER DEFAULT="UTF-8" NAME="encoding" COMMENT="The encoding used  
for reading the  
definitions">java.lang.String</PARAMETER>  
      <PARAMETER DEFAULT="true" NAME="caseSensitive" COMMENT="Should this  
gazetteer diferentiate on case. Currently the  
Gazetteer works only in case sensitive mode.">java.lang.Boolean</PARAMETER>  
      <ICON>shefGazetteer.gif</ICON>  
    </RESOURCE>

9.5 Flexible Gazetteer [#]

The Flexible Gazetteer provides users with the flexibility to choose their own customized input and an external Gazetteer. For example, the user might want to replace words in the text with their base forms (which is an output of the Morphological Analyser) or to segment a Chinese text (using the Chinese Tokeniser) before running the Gazetteer on the Chinese text.

The Flexible Gazetteer performs lookup over a document based on the values of an arbitrary feature of an arbitrary annotation type, by using an externally provided gazetteer. It is important to use an external gazetteer as this allows the use of any type of gazetteer (e.g. an Ontological gazetteer).

Input to the Flexible Gazetteer:

Runtime parameters:

Once the external gazetteer has annotated text with Lookup annotations, Lookup annotations on the temporary document are converted to Lookup annotations on the original document. Finally the temporary document is deleted.

9.6 Gazetteer List Collector [#]

The gazetteer list collector collects occurrences of entities directly from a set of annotated training texts, and populates gazetteer lists with the entities. The entity types and structure of the gazetteer lists are defined as necessary by the user. Once the lists have been collected, a semantic grammar can be used to find the same entities in new texts.

An empty list must be created first for each annotation type, if no list exists already. The set of lists must be loaded into GATE before the PR can be run. If a list already exists, the list will simply be augmented with any new entries. The list collector will only collect one occurrence of each entry: it first checks that the entry is not present already before adding a new one.

There are 4 runtime parameters:

Figure 9.1 shows a screenshot of a set of lists collected automatically for the Hindi language. It contains 4 lists: Person, Organisation, Location and a list of stopwords. Each list has a majorType whose value is the type of list, a minorType ”inferred” (since the lists have been inferred from the text), and the language ”Hindi”.


PIC


Figure 9.1: Lists collected automatically for Hindi

The list collector also has a facility to split the Person names that it collects into their individual tokens, so that it adds both the entire name to the list, and adds each of the tokens to the list (i.e. each of the first names, and the surname) as a separate entry. When the grammar annotates Persons, it can require them to be at least 2 tokens or 2 consecutive Person Lookups. In this way, new Person names can be recognised by combining a known first name with a known surname, even if they were not in the training corpus. Where only a single token is found that matches, an Unknown entity is generated, which can later be matched with an existing longer name via the orthomatcher component which performs orthographic coreference between named entities. This same procedure can also be used for other entity types. For example, parts of Organisation names can be combined together in different ways. The facility for splitting Person names is hardcoded in the file gate/src/gate/creole/GazetteerListsCollector.java and is commented.

9.7 Tree Tagger [#]

The TreeTagger is a language-independent part-of-speech tagger, which currently supports English, French, German, Spanish, Italian and Bulgarian (although the latter two are not available in GATE). It is integrated with GATE using a GATE CREOLE wrapper, originally designed by the CLaC lab (Computational Linguistics at Concordia), Concordia University, Montreal (http://www.cs.concordia.ca/research/researchgroups/clac.php).

The GATE wrapper calls TreeTagger as an external program, passing gate Tokens as input, and adding two new features to them, which hold the features as described below:

The TreeTagger plugin works on any platform that supports the tree tagger tool, including Linux, Mac OS X and Windows, but the GATE-specific scripts require a POSIX-style Bourne shell with the gawk, tr and grep commands, plus Perl for the Spanish tagger. For Windows this means that you will need to install the appropriate parts of the Cygwin environment from http://www.cygwin.com and set the system property treetagger.sh.path to contain the path to your sh.exe (typically C:\cygwin\bin\sh.exe). If this property is set, the TreeTagger plugin runs the shell given in the property and passes the tagger script as its first argument; without the property, the plugin will attempt to run the shell script directly, which fails on Windows with a cryptic “error=193”. For the GATE GUI, put the following line in build.properties (see section 3.3, and note the extra backslash before each backslash and colon in the path):

run.treetagger.sh.path: C\:\\cygwin\\bin\\sh.exe

Figure 9.2 shows a screenshot of a French document processed with the TreeTagger.


PIC

Figure 9.2: a TreeTagger processed document


9.7.1 POS tags

For English the POS tagset is a slightly modified version of the Penn Treebank tagset, where the second letter of the tags for verbs distinguishes between ”be” verbs (B), ”have” verbs (H) and other verbs (V).

The tagsets for French, German, Italian, Spanish and Bulgarian can be found in the original TreeTagger documenation at http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html..

9.8 Stemmer [#]

The stemmer plugin consists of a set of stemmers PRs for the following 11 European languages: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish. These take the form of wrappers for the Snowball stemmers freely available from http://snowball.tartarus.org. Each Token is annotated with a new feature ”stem”, with the stem for that word as its value. The stemmers should be run as other PRs, on a document that has been tokenised.

There are three runtime parameters which should be set prior to executing the stemmer on a document.

9.8.1 Algorithms

The stemmers are based on the Porter stemmer for English [Porter 80], with rules implemented in Snowball e.g.

define Step_1a as  
( [substring] among (  
 ’sses’ (<-’ss’)  
’ies’ (<-’i’)  
’ss’ () ’s’  (delete)  
 )  
 

9.9 GATE Morphological Analyzer [#]

The Morphological Analyser PR can be found in the Tools plugin. It takes as input a tokenized GATE document. Considering one token and its part of speech tag, one at a time, it identifies its lemma and an affix. These values are than added as features on the Token annotation. Morpher is based on certain regular expression rules. These rules were originally implemented by Kevin Humphreys in GATE1 in a programming language called Flex. Morpher has a capability to interepret these rules with an extension of allowing users to add new rules or modify the existing ones based on their requirements. In order to allow these operations with as little effort as possible, we changed the way these rules are written. More information on how to write these rules is explained later in Section 9.9.1.

Two types of parameters, Init-time and run-time, are required to instantiate and execute the PR.

9.9.1 Rule File [#]

GATE provides a default rule file, called default.rul, which is available under the gate/plugins/Tools/morph/resources directory. The rule file has two sections.

  1. Variables
  2. Rules

Variables

The user can define various types of variables under the section defineVars. These variables can be used as part of the regular expressions in rules. There are three types of variables:

  1. Range With this type of variable, theuser can specify the range of characters. e.g. A ==> [-a-z0-9]
  2. Set With this type of variable, user can also specify a set of characters, where one character at a time from this set is used as a value for the given variable. When this variable is used in any regular expression, all values are tried one by one to generate the string which is compared with thecontents of the document. e.g. A ==> [abcdqurs09123]
  3. Strings Where in the two types explained above, variables can hold only one character from the given set or range at a time, this allows specifying strings as possibilities for the variable. e.g. A ==> ”bb” OR ”cc” OR ”dd”

Rules

All rules are declared under the section defineRules. Every rule has two parts, LHS and RHS. The LHS specifies the regular expresssion and the RHS the function to be called when the LHS matches with the given word. ”==>” is used as delimeter between the LHS and RHS.

The LHS has the following syntax:

< * ”—”verb”—”noun>< regularexpression >.

User can specify which rule to be considered when the word is identified as ”verb” or ”noun”. ”*” indicates that the rule should be considered for all part-of-speech tags. If the part-of-speech should be used to decide if the rule should be considered or not can be enabled or disabled by setting the value of considerPOSTags option. Combination of any string along with any of the variables declared under the defineVars section and also the Klene operators, ”+” and ”*”, can be used to generate the regular expressions. Below we give few examples of L.H.S. expressions.

On the RHS of the rule, the user has to specify one of the functions from those listed below. These rules are hard-coded in the Morph PR in GATE and are invoked if the regular expression on the LHS matches with any particular word.

9.10 MiniPar Parser [#]

MiniPar is a shallow parser. In its shipped version, it takes one sentence as an input and determines the dependency relationships between the words of a sentence. It parses the sentence and brings out the information such as:

In the version of MiniPar integrated in GATE, it generates annotations of type“DepTreeNode” and the annotations of type “[relation]” that exists between the head and the child node. The document is required to have annotations of type “Sentence”, where each annotation consists of a string of the sentence.

Minipar takes one sentence at a time as an input and generates the tokens of type “DepTreeNode”. Later it assigns relation between these tokens. Each DepTreeNode consists of feature called “word”: this is the actual text of the word.

For each and every annotation of type “[Rel]”, where ‘Rel’ is obj, pred etc. This is the name of the dependency relationship between the child word and the head word (see Section 9.10.5). Every “[Rel]” annotation is assigned four features:

Figure 9.3 shows a MiniPar annotated document in GATE.


PIC

Figure 9.3: a MiniPar annotated document


9.10.1 Platform Supported

MiniPar in GATE is supported for the Linux and Windows operating systems. Trying to instantiate this PR on any other OS will generate the ResourceInstantiationException.

9.10.2 Resources

MiniPar in GATE is shipped with four basic resources:

9.10.3 Parameters

The MiniPar wrapper takes six parameters:

9.10.4 Prerequisites

The MiniPar wrapper requires the MiniPar library to be available on the underlying Linux/Windows machine. It can be downloaded from the MiniPar homepage.

9.10.5 Grammatical Relationships [#]

appo    "ACME president, --appo-> P.W. Buckman"  
aux "should <-aux-- resign"  
be  "is <-be-- sleeping"  
c   "that <-c-- John loves Mary"  
comp1   first complement  
det "the <-det ‘-- hat"  
gen "Jane’s <-gen-- uncle"  
i   the relationship between a C clause and its I clause  
inv-aux     inverted auxiliary: "Will <-inv-aux-- you stop it?"  
inv-be      inverted be: "Is <-inv-be-- she sleeping"  
inv-have    inverted have: "Have <-inv-have-- you slept"  
mod the relationship between a word and its adjunct modifier  
pnmod       post nominal modifier  
p-spec      specifier of prepositional phrases  
pcomp-c     clausal complement of prepositions  
pcomp-n     nominal complement of prepositions  
post        post determiner  
pre         pre determiner  
pred        predicate of a  clause  
rel         relative clause  
vrel        passive verb modifier of nouns  
wha, whn, whp:  wh-elements at C-spec positions  
obj         object of verbs  
obj2    second object of ditransitive verbs  
subj    subject of verbs  
s   surface subjec

9.11 RASP Parser [#]

RASP (Robust Accurate Statistical Parsing) is a robust parsing system for English, developed by the Natural Language and Computational Linguistics group at the University of Sussex.

This plugin, developed by DigitalPebble, provides four wrapper PRs that call the RASP modules as external programs, as well as a JAPE component that translates the output of the ANNIE POS Tagger (section 8.5).

RASP2 Tokenizer
This PR requires Sentence annotations and creates Token annotations with a string feature. Note that sentence-splitting must be carried out before tokenization; the the RegEx Sentence Splitter (see section 8.4) is suitable for this. (Alternatively, you can use the ANNIE Tokenizer (section 8.1) and then the ANNIE Sentence Splitter (section 8.3); their output is compatible with the other PRs in this plugin).
RASP2 POS Tagger
This requires Token annotations and creates WordForm annotations with pos, probability, and string features.
RASP2 Morphological Analyser
This requires WordForm annotations (from the POS Tagger) and adds lemma and suffix features.
RASP2 Parser
This requires the preceding annotation types and creates multiple Dependency annotations to represent a parse of each sentence.
RASP POS Converter
This PR requires Token annotations with a category feature as produced by the ANNIE POS Tagger (see section 8.5 and creates WordForm annotations in the RASP Format. The ANNIE POS Tagger and this Converter can together be used as a substitute for the RASP2 POS Tagger.

Here are some examples of corpus pipelines that can be correctly constructed with these PRs.

  1. RegEx Sentence Splitter
  2. RASP2 Tokenizer
  3. RASP2 POS Tagger
  4. RASP2 Morphological Analyser
  5. RASP2 Parser

  1. RegEx Sentence Splitter
  2. RASP2 Tokenizer
  3. ANNIE POS Tagger
  4. RASP POS Converter
  5. RASP2 Morphological Analyser
  6. RASP2 Parser

  1. ANNIE Tokenizer
  2. ANNIE Sentence Splitter
  3. RASP2 POS Tagger
  4. RASP2 Morphological Analyser
  5. RASP2 Parser

  1. ANNIE Tokenizer
  2. ANNIE Sentence Splitter
  3. ANNIE POS Tagger
  4. RASP POS Converter
  5. RASP2 Morphological Analyser
  6. RASP2 Parser

Futher documentation is included in the directory gate/plugins/rasp/doc/.

The RASP package, which provides the external programs, is available from the RASP web page.

RASP is only supported for Linux operating systems. Trying to run it on any other operating systems will generate an exception with the message: “The RASP cannot be run on any other operating systems except Linux.”

It must be correctly installed on the same machine as GATE, and must be installed in a directory whose path does not contain any spaces (this is a requirement of the RASP scripts as well as the wrapper). Before trying to run scripts for the first time, edit rasp.sh and rasp_parse.sh to set the correct value for the shell variable RASP, which should be the file system pathname where you have installed the RASP tools (for example, RASP=/opt/RASP or RASP=/usr/local/RASP. You will need to enter the same path for the initialization parameter raspHome for the POS Tagger, Morphological Analyser, and Parser PRs.

(On some systems the arch command used in the scripts is not available; a work-around is to comment that line out and add arch=’ix86_linux’, for example.)

(The previous version of the RASP plugin can now be found in plugins/Obsolete/rasp.)

9.12 SUPPLE Parser (formerly BuChart) [#]

The BuChart parser has been removed and replaced by SUPPLE: The Sheffield University Prolog Parser for Language Engineering. If you have an application which uses BuChart and wish to upgrade to a later version of GATE than 3.1 you must upgrade your application to use SUPPLE.

SUPPLE is a bottom-up parser that constructs syntax trees and logical forms for English sentences. The parser is complete in the sense that every analysis licensed by the grammar is produced. In the current version only the ’best’ parse is selected at the end of the parsing process. The English grammar is implemented as an attribute-value context free grammar which consists of subgrammars for noun phrases (NP), verb phrases (VP), prepositional phrases (PP), relative phrases (R) and sentences (S). The semantics associated with each grammar rule allow the parser to produce logical forms composed of unary predicates to denote entities and events (e.g., chase(e1), run(e2)) and binary predicates for properties (e.g. lsubj(e1,e2)). Constants (e.g., e1, e2) are used to represent entity and event identifiers. The GATE SUPPLE Wrapper stores syntactic infomation produced by the parser in the gate document in the form of parse annotations containing a bracketed representation of the parse; and semantics annotations that contains the logical forms produced by the parser. It also produces SyntaxTreeNode annotations that allow viewing of the parse tree for a sentence (see section 9.12.4).

9.12.1 Requirements

The SUPPLE parser is written in Prolog, so you will need a Prolog interpreter to run the parser. A copy of PrologCafe (http://kaminari.scitec.kobe-u.ac.jp/PrologCafe/), a pure Java Prolog implementation, is provided in the distribution. This should work on any platform but it is not particularly fast. SUPPLE also supports the open-source SWI Prolog (http://www.swi-prolog.org) and the commercially licenced SICStus prolog (http://www.sics.se/sicstus, SUPPLE supports versions 3 and 4), which are available for Windows, Mac OS X, Linux and other Unix variants. For anything more than the simplest cases we recommend installing one of these instead of using PrologCafe.

9.12.2 Building SUPPLE

The SUPPLE plugin must be compiled before it can be used, so you will require a suitable Java SDK (GATE itself requires only the JRE to run). To build SUPPLE, first edit the file build.xml in the SUPPLE directory under plugins, and adjust the user-configurable options at the top of the file to match your environment. In particular, if you are using SWI or SICStus Prolog, you will need to change the swi.executable or sicstus.executable property to the correct name for your system. Once this is done, you can build the plugin by opening a command prompt or shell, going to the SUPPLE directory and runing:

../../bin/ant swi

(on Windows, use ..\..\bin\ant). For PrologCafe or SICStus, replace swi with plcafe or sicstus as appropriate.

9.12.3 Running the parser in GATE

In order to parse a document you will need to construct an application that has:

Note that prior to GATE 3.1, the parser file parameter was of type java.io.File. From 3.1 it is of type java.net.URL. If you have a saved application (.gapp file) from before GATE 3.1 which includes SUPPLE it will need to be updated to work with the new version. Instructions on how to do this can be found in the README file in the SUPPLE plugin directory.

9.12.4 Viewing the parse tree [#]

GATE provides a syntax tree viewer in the Tools plugin which can display the parse tree generated by SUPPLE for a sentence. To use the tree viewer, be sure that the Tools plugin is loaded, then open a document that has been processed with SUPPLE and view its Sentence annotations. Right-click on the relevant Sentence annotation in the annotations table and select “Edit with syntax tree viewer”. This viewer can also be used with the constituency output of the Stanford Parser PR (section 9.13).

9.12.5 System properties [#]

The SICStusProlog (3 and 4) and SWIProlog implementations work by calling the native prolog executable, passing data back and forth in temporary files. The location of the prolog executable is specified by a system property:

If your prolog is installed under a different name, you should specify the correct name in the relevant system property. For example, when installed from the source distribution, the Unix version of SWI prolog is typically installed as pl, most binary packages install it as swipl, though some use the name swi-prolog. You can also use the properties to specify the full path to prolog (e.g. /opt/swi-prolog/bin/pl) if it is not on your default PATH.

For details of how to pass system properties to the GATE GUI, see the end of section 3.3.

9.12.6 Configuration files [#]

Two files are used to pass information from GATE to the SUPPLE parser: the mapping file and the feature table file.

Mapping file

The mapping file specifies how annotations produced using Gate are to be passed to the parser. The file is composed of a number of pairs of lines, the first line in a pair specifies a Gate annotation we want to pass to the parser. It includes the AnnotationSet (or default), the AnnotationType, and a number of features and values that depend on the AnnotationType. The second line of the pair specifies how to encode the Gate annotation in a SUPPLE syntactic category, this line also includes a number of features and values. As an example consider the mapping:

Gate;AnnotationType=Token;category=DT;string=&S  
SUPPLE;category=dt;m_root=&S;s_form=&S

It specifies how a determinant (’DT’) will be translated into a category ’dt’ for the parser. The construct ’&S’ is used to represent a variable that will be instantiated to the appropriate value during the mapping process. More specifically a token like ’The’ recognised as a DT by the POS-tagging will be mapped into the following category:

dt(s_form:’The’,m_root:’The’,m_affix:’_’,text:’_’).

As another example consider the mapping:

Gate;AnnotationType=Lookup;majorType=person_first;minorType=female;string=&S  
SUPPLE;category=list_np;s_form=&S;ne_tag=person;ne_type=person_first;gender=female

It specified that an annotation of type ’Lookup’ in Gate is mapped into a category ’list_np’ with specific features and values. More specifically a token like ’Mary’ identified in Gate as a Lookup will be mapped into the following SUPPLE category:

list_np(s_form:’Mary’,m_root:’_’,m_affix:’_’,  
text:’_’,ne_tag:’person’,ne_type:’person_first’,gender:’female’).

Feature table [#]

The feature table file specifies SUPPLE ’lexical’ categories and its features. As an example an entry in this file is:

n;s_form;m_root;m_affix;text;person;number

which specifies which features and in which order a noun category should be writen. In this case:

n(s_form:...,m_root:...,m_affix:...,text:...,person:...,number:....).

9.12.7 Parser and Grammar [#]

The parser builds a semantic representation compositionally, and a ‘best parse’ algorithm is applied to each final chart, providing a partial parse if no complete sentence span can be constructed. The parser uses a feature valued grammar. Each Category entry has the form:

Category(Feature1:Value1,...,FeatureN:ValueN)

where the number and type of features is dependent on the category type (see Section  6.1). All categories will have the features s_form (surface form) and m_root (morphological root); nominal and verbal categories will also have person and number features; verbal categories will also have tense and vform features; and adjectival categories will have a degree feature. The list_np category has the same features as other nominal categories plus ne_tag and ne_type.

Syntactic rules are specifed in Prolog with the predicate rule(LHS,RHS) where LHS is a syntactic category and RHS is a list of syntactic categories. A rule such as BNP_HEAD N (“a basic noun phrase head is composed of a noun”) is writen as follows:

rule(bnp_head(sem:E^[[R,E],[number,E,N]],number:N),  
[n(m_root:R,number:N)]).

where the feature ’sem’ is used to construct the semantics while the parser processes input, and E, R, and N are variables tobe instantiated during parsing.

The full grammar of this distribution can be found in the prolog/grammar directory, the file load.pl specifies which grammars are used by the parser. The grammars are compiled when the system is built and the compied version is used for parsing.

9.12.8 Mapping Named Entities

SUPPLE has a prolog grammar which deals with named entities, the only information required is the Lookup annotations produced by Gate, which are specified in the mapping file. However, you may want to pass named entities identified with your own Jape grammars in Gate. This can be done using a special syntactic category provided with this distribution. The category sem_cat is used as a bridge between Gate named entities and the SUPPLE grammar. An example of how to use it (provided in the mapping file) is:

Gate;AnnotationType=Date;string=&S  
SUPPLE;category=sem_cat;type=Date;text=&S;kind=date;name=&S

which maps a named entity ’Date’ into a syntactic category ’sem_cat’. A grammar file called semantic_rules.pl is provided to map sem_cat into the appropriate syntactic category expected by the phrasal rules. The following rule for example:

rule(ne_np(s_form:F,sem:X^[[name,X,NAME],[KIND,X]]),[  
sem_cat(s_form:F,text:TEXT,type:’Date’,kind:KIND,name:NAME)]).

is used to parse a ’Date’ into a named entity in SUPPLE which in turn will be parsed into a noun phrase.

9.12.9 Upgrading from BuChart to SUPPLE

In theory upgrading from BuChart to SUPPLE should be relatively straightforward. Basically any instance of BuChart needs to be replaced by SUPPLE. Specific changes which must be made are:

Making these changes to existing code should be trivial and allow application to benefit from future improvements to SUPPLE.

9.13 Stanford Parser [#]

The Stanford Parser is a probabilistic parsing system implemented in Java by Stanford University’s Natural Language Processing Group. Data files are available from Stanford for parsing Arabic, Chinese, English, and German.

This plugin, developed by the GATE team, provides a PR (gate.stanford.Parser) that acts as a wrapper around the Stanford Parser (version 1.6.1) and translates GATE annotations to and from the data structures of the parser itself. The plugin is supplied with the unmodified jar file and one English data file obtained from Stanford. Stanford’s software itself is subject to the full GPL.

The parser itself can be trained on other corpora and languages, as documented on the website, but this plugin does not provide a means of doing so. Trained data files are not compatible between different versions of the parser; in particular, note that you need version 1.6.1 data files for GATE builds numbered above 3120 (when we upgraded the plugin to Stanford version 1.6.1 on 22 January 2009) but version 1.6 files for earlier versions, including Release 5.0 beta 1.

Creating multiple instances of this PR in the same JVM with different trained data files does not work—the PRs can be instantiated, but runtime errors will almost certainly occur.

9.13.1 Input requirements

Documents to be processed by the Parser PR must already have Sentence and Token annotations, such as those produced by either ANNIE Sentence Splitter (sections 8.3 and 8.4) and the ANNIE English Tokeniser (section 8.1).

If the reusePosTags parameter is true, then the Token annotations must have category features with compatible POS tags. The tags produced by the ANNIE POS Tagger are compatible with Stanford’s parser data files for English (which also use the Penn treebank tagset).

9.13.2 Initialization parameters

parserFile
the path to the trained data file; the default value points to the English data file2 included with the GATE distribution. You can also use other files downloaded from the Stanford Parser website or produced by training the parser.
mappingFile
the optional path to a mapping file: a flat, two-column file which the wrapper can use to “translate” tags. A sample file is included.3 By default this value is null and mapping is ignored.
tlppClass
an implementation of TreebankLangParserParams, used by the parser itself to extract the dependency relations from the constituency structures. The default value is compatible with the English data file supplied. Please refer to the Stanford NLP Group’s documentation and the parser’s javadoc for a further explanation.

9.13.3 Runtime parameters

annotationSetName
the name of the annotationSet used for input (Token and Sentence annotations) and output (SyntaxTreeNode and Dependency annotations, and category and dependencies features added to Tokens).
debug
a boolean value which controls the verbosity of the wrapper’s output.
reusePosTags
if true, the wrapper will read category features (produced by an earlier POS-tagging PR) from the Token annotations and force the parser to use them.
useMapping
if this is true and a mapping file was loaded when the PR was initialized, the POS and syntactic tags produced by the parser will be translated using that file. If no mapping file was loaded, this parameter is ignored.

The following boolean parameters switch on and off the various types of output that the parser can produce. Any or all of them can be true, but if all are false the PR will simply print a warning to save time (instead of running the parser).

addPosTags
if this is true, the wrapper will add category features to the Token annotations.
addConstituentAnnotations
if true, the wrapper will mark the syntactic constituents with SyntaxTreeNode annotations that are compatible with the Syntax Tree Viewer (see section 9.12.4).
addDependencyAnnotations
if true, the wrapper will add Dependency annotations to indicate the dependency relations in the sentence.
addDependencyFeatures
if true, the wrapper will add dependencies features to the Token annotations to indicate the dependency relations in the sentence.

The parser will derive the dependency structures only if either or both of the dependency output options is enabled, so if you do not need the dependency analysis, you can disable both of them and the PR will run faster.

Two sample GATE applications for English are included in the plugins/Stanford directory: sample_parser_en.gapp runs the Regex Sentence Splitter and ANNIE Tokenizer and then this PR to annotate constituency and dependency structures, whereas sample_pos+parser_en.gapp also runs the ANNIE POS Tagger and makes the parser re-use its POS tags.

9.14 Montreal Transducer [#]

Many of the key features introduced in the Montreal Transducer (MT) have now been ported in some form into the standard JAPE transducer. If you are considering using the MT, you should first check the documentation for the standard transducer in chapter 7 to see if that is suitable for your needs. Being such a core part of GATE, the standard JAPE transducer is likely to be more stable and bugs will be fixed more rapidly than with the MT.

The Montreal Transducer is an improved Jape Transducer, developed by Luc Plamondon, Université de Montréal. It is intended to make grammar authoring easier by providing a more flexible version of the JAPE language and it also fixes a few bugs. Full details of the transducer can be found at http://www.iro.umontreal.ca/ plamondl/mtltransducer/. We summarise the main features below.

9.14.1 Main Improvements

9.14.2 Main Bug fixes

({Lookup.majorType == title})+:titles ({Token.orth == upperInitial})*:names

9.15 Language Plugins [#]

There are plugins available for processing the following languages: French, German, Spanish, Italian, Chinese, Arabic, Romanian, Hindi and Cebuano. Some of the applications are quite basic and just contain some useful processing resources to get you started when developing a full application. Others (Cebuano and Hindi) are more like toy systems built as part of an exercise in language portability.

Note that if you wish to use individual language processing resources without loading the whole application, you will need to load the relevant plugin for that language in most cases. The plugins all follow the same kind of format. Load the plugin using the plugin manager, and the relevant resources will be available in the Processing Resources set.

Some plugins just contain a list of resources which can be added ad hoc to other applications. For example, the Italian plugin simply contains a lexicon which can be used to replace the English lexicon in the default English POS tagger: this will provide a reasonable basic POS tagger for Italian.

In most cases you will also find a directory in the relevant plugin directory called data which contains some sample texts (in some cases, these are annotated with NEs).

9.15.1 French Plugin [#]

The French plugin contains two applications for NE recognition: one which includes the TreeTagger for POS tagging in French (french+tagger.gapp) , and one which does not (french.gapp). Simply load the application required from the plugins/french directory. You do not need to load the plugin itself from the plugins menu. Note that the TreeTagger must first be installed and set up correctly (see Section 9.7 for details). Check that the runtime parameters are set correctly for your TreeTagger in your application. The applications both contain resources for tokenisation, sentence splitting, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Note that they are not intended to produce high quality results, they are simply a starting point for a developer working on French. Some sample texts are contained in the plugins/french/data directory.

9.15.2 German Plugin [#]

The German plugin contains two applications for NE recognition: one which includes the TreeTagger for POS tagging in German (german+tagger.gapp) , and one which does not (german.gapp). Simply load the application required from the plugins/german/resources directory. You do not need to load the plugin itself from the plugins menu. Note that the TreeTagger must first be installed and set up correctly (see Section 9.7 for details). Check that the runtime parameters are set correctly for your TreeTagger in your application. The applications both contain resources for tokenisation, sentence splitting, gazetteer lookup, compound analysis, NE recognition (via JAPE grammars) and orthographic coreference. Some sample texts are contained in the plugins/german/data directory. We are grateful to Fabio Ciravegna and the Dot.KOM project for use of some of the components for the German plugin.

9.15.3 Romanian Plugin [#]

The Romanian plugin contains an application for Romanian NE recognition (romanian.gapp). Simply load the application from the plugins/romanian/resources directory. You do not need to load the plugin itself from the plugins menu. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Some sample texts are contained in the plugins/romanian/corpus directory.

9.15.4 Arabic Plugin [#]

The Arabic plugin contains a simple application for Arabic NE recognition (arabic.gapp). Simply load the application from the plugins/arabic/resources directory. You do not need to load the plugin itself from the plugins menu. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Note that there are two types of gazetteer used in this application: one which was derived automatically from training data (Arabic inferred gazetteer), and one which was created manually. Note that there are some other applications included which perform quite specific tasks (but can generally be ignored). For example, arabic-for-bbn.gapp and arabic-for-muse.gapp make use of a very specific set of training data and convert the result to a special format. There is also an application to collect new gazetteer lists from training data (arabic_lists_collector.gapp). For details of the gazetteer list collector please see Section 9.6.

9.15.5 Chinese Plugin [#]

The Chinese plugin contains a simple application for Chinese NE recognition (chinese.gapp). Simply load the application from the plugins/chinese/resources directory. You do not need to load the plugin itself from the plugins menu. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. The application makes use of some gazetteer lists (and a grammar to process them) derived automatically from training data, as well as regular hand-crafted gazetteer lists. There are also applications (listscollector.gapp, adj_collector.gapp and nounperson_collector.gapp) to create such lists, and various other application to perform special tasks such as coreference evaluation (coreference_eval.gapp) and converting the output to a different format (ace-to-muse.gapp).

9.15.6 Hindi Plugin [#]

The Hindi plugin contains a set of resources for basic Hindi NE recognition which mirror the ANNIE resources but are customised to the Hindi language. You need to have the ANNIE plugin loaded first in order to load any of these PRs. With the Hindi, you can create an application similar to ANNIE but replacing the ANNIE PRs with the default PRs from the plugin.

9.16 Chemistry Tagger [#]

This GATE module is designed to tag a number of chemistry items in running text. Currently the tagger tags compound formulas (e.g. SO2, H2O, H2SO4 ...) ions (e.g. Fe3+, Cl-) and element names and symbols (e.g. Sodium and Na). Limited support for compound names is also provided (e.g. sulphur dioxide) but only when followed by a compound formula (in parenthesis or commas).

9.16.1 Using the tagger

The Tagger requires the Creole plugin ”Chemistry_Tagger” to be loaded. It requires the following PRs to have been run first: tokeniser and sentence splitter. There are four init parameters giving the locations of the two gazetteer list definitions, the element mapping file and the JAPE grammar used by the tagger (in previous versions of the tagger these files were fixed and loaded from inside the ChemTagger.jar file). Unless you know what you are doing you should accept the default values.

The annotations added to documents are ”ChemicalCompound”, ”ChemicalIon” and ”ChemicalElement” (currently they are always placed in the default annotation set). By default ”ChemicalElement” annotations are removed if they make up part of a larger compound or ion annotation. This behaviour can be changed by setting the removeElements parameter to false so that all recognised chemical elements are annotated.

9.17 Flexible Exporter [#]

The Flexible Exporter enables the user to save a document (or corpus) in its original format with added annotations. The user can select the name of the annotation set from which these annotations are to be found, which annotations from this set are to be included, whether features are to be included, and various renaming options such as renaming the annotations and the file.

At load time, the following parameters can be set for the flexible exporter:

The following runtime parameters can also be set (after the file has been selected for the application):

9.18 Annotation Set Transfer [#]

The Annotation Set Transfer allows copying or moving annotations to a new annotation set if they lie between the beginning and the end of an annotation of a particular type (the covering annotation). For example, this can be used when a user only wants to run a processing resource over a specific part of a document, such as the Body of an HTML document. The user specifies the name of the annotation set and the annotation which covers the part of the document they wish to transfer, and the name of the new annotation set. All the other annotations corresponding to the matched text will be transferred to the new annotation set. For example, we might wish to perform named entity recognition on the body of an HTML text, but not on the headers. After tokenising and performing gazetteer lookup on the whole text, we would use the Annotation Set Transfer to transfer those annotations (created by the tokeniser and gazetteer) into a new annotation set, and then run the remaining NE resources, such as the semantic tagger and coreference modules, on them.

The Annotation Set Transfer has no loadtime parameters. It has the following runtime parameters:

For example, suppose we wish to perform named entity recognition on only the text covered by the BODY annotation from the Original Markups annotation set in an HTML document. We have to run the gazetteer and tokeniser on the entire document, because since these resources do not depend on any other annotations, we cannot specify an input annotation set for them to use. We therefore transfer these annotations to a new annotation set (Filtered) and then perform the NE recognition over these annotations, by specifying this annotation set as the input annotation set for all the following resources. In this example, we would set the following parameters (assuming that the annotations from the tokenise and gazetteer are initially placed in the Default annotation set).

9.19 Information Retrieval in GATE [#]

GATE comes with a full-featured Information Retrieval (IR) subsystem that allows queries to be performed against GATE corpora. This combination of IE and IR means that documents can be retrieved from the corpora not only based on their textual content but also according to their features or annotations. For example, a search over the Person annotations for ”Bush” will return documents with higher relevance, compared to a search in the content for the string ”bush”. The current implementation is based on the most popular open source full-text search engine - Lucene (available at http://jakarta.apache.org/lucene/) but other implementations may be added in the future.

An Information Retrieval system is most often considered a system that accepts as input a set of documents (corpus) and a query (combination of search terms) and returns as input only those documents from the corpus which are considered as relevant according to the query. Usually, in addition to the documents, a proper relevance measure (score) is returned for each document. There exist many relevance metrics, but usually documents which are considered more relevant, according to the query, are scored higher.

Figure 9.4 shows the results from running a query against an indexed corpus in GATE.


PIC


Figure 9.4: Documents with scores, returned from a search over a corpus


Information Retrieval systems usually perform some preprocessing one the input corpus in order to create the document-term matrix for the corpus. A document-term matrix is usually presented as:








Term1Term2......termk






Doc1 w1,1 w1,2 ...... w1,k






Doc2 w2,1 w2,1 ...... w2,k






... ... ... ...... ...






... ... ... ...... ...






docn wn, 1 wn,2 ...... wn,k







Table 9.1:

where doci is a document from the corpus, termj is a word that is considered as important and representative for the document and wi,j is the weight assigned to the term in the document. There are many ways to define the term weight functions, but most often it depends on the term frequency in the document and in the whole corpus (i.e. the local and the global frequency). Note that the machine learning plugin described in Chapter 11 can produce such document-term matrix (for detailed description of the matrix produced see Section 11.5.4).

Note that not all of the words appearing in the document are considered terms. There are many words (called ”stop-words”) which are ignored, since they are observed too often and are not representative enough. Such words are articles, conjunctions, etc. During the preprocessing phase which identifies such words, usually a form of stemming is performed in order to minimize the number of terms and to improve the retrieval recall. Various forms of the same word (e.g. ”play”, ”playing” and ”played”) are considered identical and multiple occurrences of the same term (probably ”play”) will be observed.

It is recommended that the user reads the relevant Information Retrieval literature for a detailed explanation of stop words, stemming and term weighting.

IR systems, in a way similar to IE systems, are evaluated with the help of the precision and recall measures (see Section 13.4 for more details).

9.19.1 Using the IR functionality in GATE

In order to run queries against a corpus, the latter should be ”indexed”. The indexing process first processes the documents in order to identify the terms and their weights (stemming is performed too) and then creates the proper structures on the local filesystem. These file structures contain indexes that will be used by Lucene (the underlying IR engine) for the retrieval.

Once the corpus is indexed, queries may be run against it. Subsequently the index may be removed and then the structures on the local filesytem are removed too. Once the index is removed, queries cannot be run against the corpus.

Indexing the corpus

In order to index a corpus, the latter should be stored in a serial datastore. In other words, the IR functionality is unavailable for corpora that are transient or stored in a RDBMS datastores (though support for the lattr may be added in the future).

To index the corpus, follow these steps:


PIC


Figure 9.5: Indexing a corpus by specifying the index location and indexed features (and content)


Querying the corpus

To query the corpus, follow these steps:

Removing the index

An index for a corpus may be removed at any time from the ”Remove Index” option of the context menu for the indexed corpus (right button click).

9.19.2 Using the IR API

The IR API within GATE makes it possible for corpora to be indexed, queried and results returned from any Java application, without using the GATE GUI. The following sample indexes a corpus, runs a query against it and then removes the index.

 
// open a serial data store  
SerialDataStore sds =  
Factory.openDataStore("gate.persist.SerialDataStore",  
"/tmp/datastore1");  
sds.open();  
 
//set an AUTHOR feature for the test document  
Document doc0 = Factory.newDocument(new URL("/tmp/documents/doc0.html"));  
doc0.getFeatures().put("author","John Smit");  
 
Corpus corp0 = Factory.newCorpus("TestCorpus");  
corp0.add(doc0);  
 
//store the corpus in the serial datastore  
Corpus serialCorpus = (Corpus) sds.adopt(corp0,null);  
sds.sync(serialCorpus);  
 
//index the corpus -  the content and the AUTHOR feature  
 
IndexedCorpus indexedCorpus = (IndexedCorpus) serialCorpus;  
 
DefaultIndexDefinition did = new DefaultIndexDefinition();  
did.setIrEngineClassName(gate.creole.ir.lucene. LuceneIREngine.class.getName());  
did.setIndexLocation("/tmp/index1");  
did.addIndexField(new IndexField("content", new DocumentContentReader(), false));  
did.addIndexField(new IndexField("author", null, false));  
indexedCorpus.setIndexDefinition(did);  
 
indexedCorpus.getIndexManager().createIndex();  
//the corpus is now indexed  
 
//search the corpus  
Search search = new LuceneSearch();  
search.setCorpus(ic);  
 
QueryResultList res = search.search("+content:government +author:John");  
 
//get the results  
Iterator it = res.getQueryResults();  
while (it.hasNext()) {  
QueryResult qr = (QueryResult) it.next();  
System.out.println("DOCUMENT_ID="+ qr.getDocumentID() +",   scrore="+qr.getScore());  
}  

9.20 Crawler [#]

The crawler plugin enables GATE to be used for a corpus that is built using a web crawl. The crawler itself is Websphinx.This is a JAVA based multi-threaded web crawler that can be customized for any application. In order to use this plugin it may be required that the websphinx.jar file be added in the required libraries in JBuilder.

The basic idea is to be able to specify a source URL and a depth to build the initial corpus upon which further processing could be done. The PR itself provides a number of helpful features to set various parameters of the crawl.

9.20.1 Using the Crawler PR

In order to use the processing resource you first need to load the plugin using the plugin manager. Then load the crawler from the list of processing resources. User needs to create a corpus in which he or she wants to store crawled documents. In order to use the crawler, create a simple pipeline (note: do not create a corpus pipeline) and add the crawl PR to the pipeline.

Once the crawl PR is created there will be a number of parameters that can be set based on the PR required (see also Figure 9.6).


PIC


Figure 9.6: Crawler parameters


Once the parameters are set, the crawl can be run and the documents fetched are added to the specified corpus. Figure 9.7 shows the crawled pages added to the corpus.


PIC


Figure 9.7: Crawled pages added to the corpus


9.21 Google Plugin [#]

The Google API is now integrated with GATE, and can be used as a PR-based plugin. This plugin allows the user to query Google and build the document corpus that contains the search results returned by Google for the query. There is a limit of 1000 queries per day as set by Google. For more information about the Google API please refer to http://www.google.com/apis/. In order to use the Google PR, you need to register with Google to obtain a license key.

The Google PR can be used for a number of different application scenarios. For example, one use case is where a user wants to find out what are the different named entities that can be associated with a particular individual. In this example, the user could build the collection of documents by querying Google and then running ANNIE over the collection. This would annotate the results and show what are the different Organization, Location and other entities that can be associated with the query.

9.21.1 Using the GooglePR

In order to use the PR, you first need to load the plugin using the plugin manager. Once the PR is loaded, it can be initialized by creating an instance of a new PR. Here you need to specify the Google API License key. Please use the license key assigned to you by registering with Google.

Once the Google PR is initialized, it can be placed in a pipeline or a conditional pipeline application. This pipeline would contain the instance of the Google PR just initialized as above. There are a number of parameters to be set at runtime:

Once the required parameters are set we can run the pipeline. This will then download all the URLs in the results and create a document for each. These documents would be added to the corpus as shown in Figure 9.8.


PIC


Figure 9.8: URLs added to the corpus


9.22 Yahoo Plugin [#]

The Yahoo API is now integrated with GATE, and can be used as a PR-based plugin. This plugin allows the user to query Yahoo and build the document corpus that contains the search results returned by Yahoo for the query. For more information about the Yahoo API please refer to http://developer.yahoo.com/search/. In order to use the Yahoo PR, you need to obtain an application ID.

The Yahoo PR can be used for a number of different application scenarios. For example, one use case is where a user wants to find out what are the different named entities that can be associated with a particular individual. In this example, the user could build the collection of documents by querying Yahoo and then running ANNIE over the collection. This would annotate the results and show what are the different Organization, Location and other entities that can be associated with the query.

9.22.1 Using the YahooPR

In order to use the PR, you first need to load the plugin using the plugin manager. Once the PR is loaded, it can be initialized by creating an instance of a new PR. Here you need to specify the Yahoo Application ID. Please use the license key assigned to you by registering with Yahoo.

Once the Yahoo PR is initialized, it can be placed in a pipeline or a conditional pipeline application. This pipeline would contain the instance of the Yahoo PR just initialized as above. There are a number of parameters to be set at runtime:

Once the required parameters are set we can run the pipeline. This will then download all the URLs in the results and create a document for each. These documents would be added to the corpus.

9.23 WordNet in GATE [#]


PIC


Figure 9.9: WordNet in GATE – results for “bank”



PIC


Figure 9.10: WordNet in GATE


At present GATE supports only WordNet 1.6, so in order to use WordNet in GATE, you must first install WordNet 1.6 on your computer. WordNet is available at http://wordnet.princeton.edu/. The next step is to configure GATE to work with your local WordNet installation. Since GATE relies on the Java WordNet Library (JWNL) for WordNet access, this step consists of providing one special xml file that is used internally by JWNL. This file describes the location of your local copy of the WordNet 1.6 index files. An example of this wn-config.xml file is shown below:

 
<?xml version="1.0" encoding="UTF-8"?>  
 
<jwnl_properties language="en">  
        <version publisher="Princeton" number="1.6" language="en"/>  
        <dictionary class="net.didion.jwnl.dictionary.FileBackedDictionary">  
                <param name="morphological_processor" value="net.didion.jwnl.dictionary.DefaultMorphologicalProcessor"/>  
                <param name="file_manager" value="net.didion.jwnl.dictionary.file_manager.FileManagerImpl">  
                        <param name="file_type" value="net.didion.jwnl.princeton.file.PrincetonRandomAccessDictionaryFile"/>  
                        <param name="dictionary_path" value="e:\wn16\dict"/>  
                </param>  
        </dictionary>  
        <dictionary_element_factory class="net.didion.jwnl.princeton.data.PrincetonWN16DictionaryElementFactory"/>  
        <resource class="PrincetonResource"/>  
</jwnl_properties>  

All you have to do is to replace the value of the dictionary_path parameter to point to your local installation of WordNet 1.6.

After configuring GATE to use WordNet, you can start using the built-in WordNet browser or API. In GATE, load the WordNet plugin via the plugins menu. Then load WordNet by selecting it from the set of available language resources. Set the value of the parameter to the path of the xml properties file which describes the WordNet location (wn-config).

Once Word Net is loaded in GATE, the well-known interface of WordNet will appear. You can search Word Net by typing a word in the box next to to the label “SearchWord”’ and then pressing “Search”. All the senses of the word will be displayed in the window below. Buttons for the possible parts of speech for this word will also be activated at this point. For instance, for the word “play”, the buttons “Noun”, “Verb” and “Adjective” are activated. Pressing one of these buttons will activate a menu with hyponyms, hypernyms, meronyms for nouns or verb groups, and cause for verbs, etc. Selecting an item from the menu will display the results in the window below.

More information about WordNet can be found at http://www.cogsci.princeton.edu/wn/index.shtml

More information about the JWNL library can be found at http://sourceforge.net/projects/jwordnet

An example of using the WordNet API in GATE is available on the GATE examples page at http://gate.ac.uk/GateExamples/doc/index.html

9.23.1 The WordNet API

GATE offers a set of classes that can be used to access the WordNet 1.6 Lexical Base. The implementation of the GATE API for WordNet is based on Java WordNet Library (JWNL). There are just a few basic classes, as shown in Figure 9.11. Details about the properties and methods of the interfaces/classes comprising the API can be obtained from the JavaDoc. Below is a brief overview of the interfaces:


PIC

Figure 9.11: The Wordnet API


9.24 Machine Learning in GATE [#]

Note: A brand new machine learning layer specifically targetted at NLP tasks including text classification, chunk learning (e.g. for named entity recognition) and relation learning has been added to GATE. See chapter 11 for more details.

9.24.1 ML Generalities

This section describes the use of Machine Learning (ML) algorithms in GATE.

An ML algorithm ”learns” about a phenomenon by looking at a set of occurrences of that phenomenon that are used as examples. Based on these, a model is built that can be used to predict characteristics of future (and unforeseen) examples of the phenomenon.

Classification is a particular example of machine learning in which the set of training examples is split into multiple subsets (classes) and the algorithm attempts to distribute the new examples into the existing classes.

This is the type of ML that is used in GATE and all further references to ML actually refer to classification.

Some definitions

GATE-specific interpretation of the above definitions

An ML implementation has two modes of functioning: training and application. The training phase consists of building a model (e.g. statistical model, a decision tree, a rule set, etc.) from a dataset of already classified instances. During application, the model built while training is used to classify new instances.

There are ML algorithms which permit the incremental building of the model (e.g. the Updateable Classifiers in the WEKA library). These classifiers do not require the entire training dataset to build a model; the model improves with each new training instance that the algorithm is provided with.

9.24.2 The Machine Learning PR in GATE

Access to ML implementations is provided in GATE by the ”Machine Learning PR” that handles both the training and application of ML model on GATE documents. This PR is a Language Analyser so it can be used in all default types of GATE controllers.

In order to allow for more flexibility, all the configuration parameters for the ML PR are set through an external XML file and not through the normal PR parameterisation. The root element of the file needs to be called ”ML-CONFIG” and it contains two elements: ”DATASET” and ”ENGINE”. An example XML configuration file is given in Appendix F.

The DATASET element

The DATASET element defines the type of annotation to be used as instance and the set of attributes that characterise all the instances.

An ”INSTANCE-TYPE” element is used to select the annotation type to be used for instances, and the attributes are defined by a sequence of ”ATTRIBUTE” elements.

For example, if an ”INSTANCE-TYPE” has a ”Token” for value, there will one instance in the dataset per ”Token”. This also means that the positions (see below) are defined in relation to Tokens. The ”INSTANCE-TYPE” can be seen as the smallest unit to be taken into account for the Machine Learning.

An ATTRIBUTE element has the following sub-elements:

The VALUES being defined as XML entities, the characters <, > and & must be replaced by &lt;, &rt; and &amp;. It is recommended to write the XML configuration file in UTF-8 in order to have some uncommon character correctly parsed.

Semantically, there are three types of attributes:

Figure 9.12 gives some examples of what the values of specified attributes would be in a situation when ”Token” annotations are used as instances.


PIC


Figure 9.12: Sample attributes and their values


An ATTRIBUTELIST element is similar to ATTRIBUTE except that it has no POSITION sub-element but a RANGE element. This will be converted into several ATTRIBUTELIST with position ranging from the value of the attribute ”from” to the value of the attribute ”to”. This can be used in order to avoid the duplication of ATTRIBUTE elements.

The ENGINE element

The ENGINE element defines which particular ML implementation will be used, and allows the setting of options for that particular implementation.

The ENGINE element has three sub-elements:

9.24.3 The WEKA Wrapper

GATE provides a wrapper for the WEKA ML Library (http://www.cs.waikato.ac.nz/ml/weka/) in the form of the gate.creole.ml.weka.Wrapper class.

Options for the WEKA wrapper

The WEKA wrapper accepts the following options:

9.24.4 Training an ML model with the ML PR and WEKA wrapper

The ML PR has a Boolean runtime parameter named ”training”. When the value of this parameter is set to true, the PR will collect a dataset of instances from the documents on which it is run. If the classifier used is an updatable classifier then the ML model will be built while collecting the dataset. If the selected classifier is not updatable, then the model will be built the first time a classification is attempted.

Training a model consists of designing a definition file for the ML PR, and creating an application containing an ML PR. When the application is run over a corpus, the dataset (and the model if possible) is built.

9.24.5 Applying a learnt model

Using the same ML PR, set the ”training” parameter to false and run your application.

Depending on the type of the attribute that is marked as class, different actions will be performed when a classification occurs:

Once a model is learnt, it can be saved and reloaded at a later time. The WEKA wrapper also provides an operation for saving only the dataset in the ARFF format, which can be used for experiments in the WEKA interface. This could be useful for determining the best algorithm to be used and the optimal options for the selected algorithm.

9.24.6 The MAXENT Wrapper [#]

GATE also provides a wrapper for the Open NLP MAXENT library (http://maxent.sourceforge.net/about.html). The MAXENT library provides an implementation of the maximum entropy learning algorithm, and can be accessed using the gate.creole.ml.maxent.MaxentWrapper class.

The MAXENT library requires all attributes except for the class attribute to be boolean, and that the class attribute be boolean or nominal. (It should be noted that, within maximum entropy terminology, the class attribute is called the ’outcome’.) Because the MAXENT library does not provide a specific format for data sets, there is no facility to save or load data sets separately from the model, but if there should be a need to do this, the WEKA wrapper can be used to collect the data.

Training a MAXENT model follows the same general procedure as for WEKA models, but the following difference should be noted. MAXENT models are not updateable, so the model will always be created and trained the first time a classification is attempted. The training of the model might take a considerable amount of time, depending on the amount of training data and the parameters of the model.

Options for the MAXENT Wrapper

9.24.7 The SVM Light Wrapper [#]

From version 3.0, GATE provides a wrapper for the SVM Light ML system (http://svmlight.joachims.org). SVM Light is a support vector machine implementation, written in C, which is provided as a set of command line programs. The GATE wrapper takes care of the mundane work of converting the data structures between GATE and SVM Light formats, and calls the command line programs in the right sequence, passing the data back and forth in temporary files. The <WRAPPER> value for this engine is gate.creole.ml.svmlight.SVMLightWrapper.

The SVM Light binaries themselves are not distributed with GATE – you should download the version for your platform from http://svmlight.joachims.org and place svm_learn and svm_classify on your path.

Classifying documents using the SVMLightWrapper is a two phase procedure. In its first phase, SVMWrapper collects data from the pre-annotated documents and builds the SVM model using the collected data to classify the unseen documents in its second phase. Below we describe briefly an example of classifying the start time of the seminar in a corpus of email announcing seminars and provide more details later in the section.

Figure 9.13 explains step by step the process of collecting training data for the SVM classifier. GATE documents, which are pre-annotated with the annotations of type Class and feature type=’stime’, are used as the training data. In order to build the SVM model, we require start and end annotations for each stime annotation. We use pre-processor JAPE transduction script to mark the sTimeStart and sTimeEnd annotations on stime annotations. Following this step, the Machine Learning PR (SVMLightWrapper) with training mode set to true collects the training data from all training documents. GATE corpus pipeline, given a set of documents and PRs to execute on them, executes all PRs one by one only on one document at a time. Unless provided in a separate pipleline, it makes it impossible to send all training data (i.e. collected from all documents) altogether to the SVMWrapper using the same pipeline to build the SVM model. This results into the model not being built at the time of collecting training data. The state of the SVMWrapper can be saved to an external file once the training data is collected.


PIC

Figure 9.13: Flow diagram explaining the SVM training data collection


Before classifying any unseen document, SVM requires the SVM model to be available. In the absence of an up-to-date SVM model, SVMWrapper builds a new one using a command line SVM_learn utility and the training data collected from the training corpus. In other words, the first SVM model is built when user tries to classify the first document. At this point the user has an option to save the model somewhere on the external storage. This is in order to reload the model prior to classifying other documents and to avoid rebuilding of the SVM model everytime the user classifies a new set of documents. Once the model becomes available, SVMWrapper classifies the unseen documents which creates new sTimeStart and sTimeEnd annotations over the text. Finally, a post-processor JAPE transduction script is used to combine them into the sTime annotation. Figure 9.14 explains this process.


PIC

Figure 9.14: Flow diagram explaining document classifying process


The wrapper allows support vector machines to be created which either do boolean classification or regression (estimation of numeric parameters), and so the class attribute can be boolean or numeric. Additionally, when learning a classifier, SVM Light supports transduction, whereby additional examples can be presented during training which do not have the value of the class attribute marked. Presenting such examples can, in some circumstances, greatly improve the performance of the classifier. To make use of this within GATE, the class attribute can be a three value nominal, in which case the first value specified for that nominal in the configuration file will be interpreted as true, the second as false and the third as unknown. Transduction will be used with any instances for which this attribute is set to the unknown value. It is also possible to use a two value nominal as the class attribute, in which case it will simply be interpreted as true or false.

The other attributes can be boolean, numeric or nominal, or any combination of these. If an attribute is nominal, each value of that attribute maps to a separate SVM Light feature. Each of these SVM Light features will be given the value 1 when the nominal attribute has the corresponding value, and will be omitted otherwise. If the value of the nominal is not specified in the configuration file or there is no value for an instance, then no feature will be added.

An extension to the basic functionality of SVM Light is that each attribute can receive a weighting. These weighting can be specified in the configuration file by adding <WEIGHTING> tags to the parts of the XML file specifying each attribute. The weighting for the attribute must be specified as a numeric value, and be placed between an opening <WEIGHTING> tag and a closing </WEIGHTING> one. Giving an attribute a greater weighting, will cause it to play a greater role in learning the model and classifying data. This is achieved by multiplying the value of the attribute by the weighting before creating the training or test data that is passed to SVM Light. Any attribute left without an explicitly specified weighting is given a default weighting of one. Support for these weightings is contained in the Machine Learning PR itself, and so is available to other wrappers, though at time of writing only the SVM Light wrapper makes use of weightings.

As with the MAXENT wrapper, SVM Light models are not updateable, so the model will be trained at the first classification attempt. The SVM Light wrapper supports <BATCH-MODE-CLASSIFICATION />, which should be used unless you have a very good reason not to.

The SVM Light wrapper allows both data sets and models to be loaded and saved to files in the same formats as those used by SVM Light when it is run from the command line. When a model is saved, a file will be created which contains information about the state of the SVM Light Wrapper, and which is needed to restore it when the model is loaded again. This file does not, however, contain any information about the SVM Light model itself. If an SVM Light model exists at the time of saving, and that model is up to date with respect to the current state of the training data, then it will be saved as a separate file, with the same name as the file containing information about the state of the wrapper, but with .NativePart appended to the filename. These files are in the standard SVM Light model format, and can be used with SVM Light when it is run from the command line. When a model is reloaded by GATE, both of these files must be available, and in the same directory, otherwise an error will result. However, if an up to date trained model does not exist at the time the model is saved, then only one file will be created upon saving, and only that file is required when the model is reloaded. So long as at least one training instance exists, it is possible to bring the model up to date at any point simply by classifying one or more instances (i.e. running the model with the training parameter set to false).

Options for the SVM Light engine

Only one <OPTIONS> subelement is currently supported:

9.25 MinorThird [#]

MinorThird is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text. It was written primarily by William W. Cohen, a professor at Carnegie Mellon University in the Center for Automated Learning and Discovery, though contributions have been made by many other colleagues and students.

Minorthird’s toolkit of learning methods is integrated tightly with the tools for manually and programmatically annotating text. In Minorthird, a collection of documents are stored in a database called a ”TextBase”. Logical assertions about documents in a TextBase can be made, and stored in a special ”TextLabels” object. ”TextLabels” are a type of ”stand off annotation” — unlike XML markup (for instance), the annotations are completely independent of the text. This means that the text can be stored in its original form, and that many different types of (perhaps incompatible) annotations can be associated with the same TextBase.

Each TextLabels annotation asserts a category or property for a word, a document, or a subsequence of words. (In Minorthird, a sequence of adjacent words is called a ”span”.) As an example, these annotations might be produced by human labelers; they might be produced by a hand-writted program, or annotations by a learned program. TextLabels might encode syntactic properties (like shallow parses or part of speech tags) or semantic properties (like the functional role that entities play in a sentence). TextLabels can be nested, much like variable-binding environments can be nested in a programming language, which enables sets of hypothetical or temporary labels to be added in a local scope and then discarded.

Annotated TextBases are accessed in a single uniform way. However, they are stored in one of several schemes. A Minorthird ”repository” can be configured to hold a bunch of TextLabels and their associated TextBases.

Moderately complex hand-coded annotation programs can be implemented with a special-purpose annotation language called Mixup, which is part of Minorthird. Mixup is based on a the widely used notion of cascaded finite state transducers, but includes some powerful features, including a GUI debugging environment, escape to Java, and a kind of subroutine call mechanism. Mixup can also be used to generate features for learning algorithms, and all the text-based learning tools in Minorthird are closely integrated with Mixup. For instance, feature extractors used in a learned named-entity recognition package might call a Mixup program to perform initial preprocessing of text.

Minorthird contains a number of methods for learning to extract and label spans from a document, or learning to classify spans (based on their content or context within a document). A special case of classifying spans is classifying entire documents. Minorthird includes a number of state-of-the-art sequential learning methods (like conditional random fields, and discriminative training methods for training hidden Markov models).

One practical difficulty in using learning techniques to solve NLP problems is that the input to learners is the result of a complex chain of transformations, which begin with text and end with very low-level representations. Verifying the correctness of this chain of derivations can be difficult. To address this problem, Minorthird also includes a number of tools for visualizing transformed data and relating it to the text from which it was derived.

More information about MinorThird can be found at http://minorthird.sourceforge.net/.

9.26 MIAKT NLG Lexicon [#]


PIC

Figure 9.15: Example NLG Lexicon

In order to lower the overhead of NLG lexicon development, we have created graphical tools for editing, storage, and maintenance of NLG lexicons, combined with a model which connects lexical entries to concepts and instances in the ontology. GATE also provides access to existing general-purpose lexicons such as WordNet and thus enables their use in NLG applications.

The structure of the NLG lexicons is similar to that of WordNet. Each lexical entry has a lemma, sense number, and syntactic information associated with it (e.g., part of speech, plural form). Each lexical entry also belongs to a synonym set or synset, which groups together all word senses which are synonymous. For example, as shown in Figure 9.15, the lemma “Magnetic Resonance Imaging scan” has one sense, its part of speech is noun, and it belongs to the synset containing also the first sense of the “MRI scan” lemma. Each synset also has a definition, which is shown in order to help the user when choosing the relevant synset for new word senses.

When the user adds a new lemma to the lexicon, it needs to be assigned to an existing synset. The editor also provides functionality for creating a new synset with part of speech and definition. (see Figure 9.16).


PIC

Figure 9.16: Editing synset information

The advantage of a synset-based lexicon is that while there can be a one-to-one mapping between concepts and instances in the ontology and synsets, the generator can still use different lexicalisations by choosing them among those listed in the synset (e.g., MRI or Magnetic Resonance Imaging). In other words, synsets effectively correspond to concepts or instances in the ontology and their entries specify possible lexicalisations of these concepts/instances in natural language.

At present, the NLG lexicon encodes only synonymy, while other non-lexical relations present in WordNet like hypernymy and hyponymy (i.e., superclass and subclass relations) are instead derived from the ontology, using the mapping between the synsets and concepts/instances. The reason behind this architectural choice comes from the fact that ontology-based generators ultimately need to use the ontology as the knowledge source. In this framework, the role of the lexicon is to provide lexicalisations for the ontology classes and instances.

9.26.1 Complexity and Generality [#]

The lexicon model was kept as generic as possible by making it incorporate only minimal lexical information. Additional, generator-specific information can be stored in a hash table, where values can be retrieved by their name. Since these are generator specific, the current lexicon user interface does not support editing of this information, although it can be accessed and modified programmatically.

On the other hand, the NLG lexicon is based on synonym sets, so generators which subscribe to a different model of synonymy might be able to access GATE-based NLG lexicons only via a wrapper mapping between the two models.

Given that the lexicon structure follows the WordNet synset model, such a lexicon can potentially be used for language analysis, if the application only requires synonymy. Our NLG lexicon model does not support yet the richer set of relations in WordNet such as hypernymy, although it is possible to extend the current model with richer relations. Since we used the lexicon in conjunction with the ontology, such non-linguistic relations were instead taken from the ontology.

The NLG lexicon itself is also independent from the generator’s input knowledge and its format.

9.27 Kea - Automatic Keyphrase Detection [#]

Kea is a tool for automatic detection of key phrases developed at the University of Waikato in New Zealand. The home page of the project can be found at http://www.nzdl.org/Kea/.

This user guide section only deals with the aspects relating to the integration of Kea in GATE. For the inner workings of Kea, please visit the Kea web site and/or contact its authors.

In order to use Kea in GATE, the “Kea” plugin needs to be loaded using the plugins management console. After doing that, two new resource types are available for creation: the “KEA Keyphrase Extractor” (a processing resource) and the “KEA Corpus Importer” (a visual resource associated with the PR).

9.27.1 Using the “KEA Keyphrase Extractor” PR

Kea is based on machine learning and it needs to be trained before it can be used to extract keyphrases. In order to do this, a corpus is required where the documents are annotated with keyphrases. Corpora in the Kea format (where the text and keyphrases are in separate files with the same name but different extensions) can be imported into GATE using the “KEA Corpus Importer” tool. The usage of this tool is presented in a sub-section below.

Once an annotated corpus is obtained, the “KEA Keyphrase Extractor” PR can be used to build a model:

  1. load a “KEA Keyphrase Extractor”
  2. create a new “Corpus Pipeline” controller.
  3. set the corpus for the controller
  4. set the ‘trainingMode’ parameter for the PR to ‘true’
  5. run the application.

After these steps, the Kea PR contains a trained model. This can be used immediately by switching the ‘trainingMode’ parameter to ‘false’ and running the PR over the documents that need to be annotated with keyphrases. Another possiblity is to save the model for later use, by right-clicking on the PR name in the right hand side tree and choosing the ”Save model” option.

When a previously built model is availalbe, the training procedure does not need to be repeated, the exisiting model can be loaded in memory by selecting the “Load model” option in the PR’s pop-up menu.


PIC

Figure 9.17: Parameters used by the Kea PR


The Kea PR uses several parameters as seen in Figure 9.17:

document
The document to be processed.
inputAS
The input annotation set. This parameter is only relevant when the PR is running in training mode and it specifies the annotation set containing the keyphrase annotations.
outputAS
The output annotation set. This parameter is only relevant when the PR is running in application mode (i.e. when the ‘trainingMode’ parameter is set to false) and it specifies the annotation set where the generated keyphrase annotations will be saved.
minPhraseLength
the minimum length (in number of words) for a keyphrase.
minNumOccur
the minimum number of occurences of a phrase for it to be a keyphrase.
maxPhraseLength
the maximum length of a keyphrase.
phrasesToExtract
how many different keyphrases should be generated.
keyphraseAnnotationType
the type of annotations used for keyphrases.
dissallowInternalPeriods
should internal periods be dissallowed.
trainingMode
if ‘true’ the PR is running in training mode; otherwise it is running in application mode.
useKFrequency
should the K-frequency be used.

9.27.2 Using Kea corpora

The authors of Kea provide on the project web page a few manually annotated corpora that can be used for training Kea. In order to do this from within GATE, these corpora need to be converted to the format used in GATE (i.e. GATE documents with annotations). This is possible using the “KEA Corpus Importer” tool which is available as a visual resource associated with the Kea PR. The importer tool can be made visible by double-clicking on the Kea PR’s name in the resources tree and then selecting the “KEA Corpus Importer” tab, see Figure 9.18.


PIC

Figure 9.18: Options for the “KEA Corpus Importer”


The tool will read files from a given directory, converting the text ones into GATE documents and the ones containing keyphrases into annotations over the documents.

The user needs to specify a few values:

Source Directory
the directory containing the text and key files. This can be typed in or selected by pressing the folder button next to the text field.
Extension for text files
the extension used for text fiels (by default .txt).
Extension for keyphrase files
the extension for the files listing keyphrases.
Encoding for input files
the encoding to be used when reading the files.
Corpus name
the name for the GATE corpus that will be created.
Output annotaion set
the name for the anntoation set that will contain the keyphrases read from the input files.
Keyphrase annotation type
the type for the generated annotations.

9.28 Ontotext JapeC Compiler [#]

Note: the JapeC compiler does not currently support the new JAPE language features introduced in July–September 2008. If you need to use negation, the @length and @string accessors, the contextual operators within and contains, or any comparison operators other than ==, then you will need to use the standard JAPE transducer instead of JapeC.

Japec is an alternative implementation of the JAPE language which works by compiling JAPE grammars into Java code. Compared to the standard implementation, these compiled grammars can be several times faster to run. At Ontotext, a modified version of the ANNIE sentence splitter using compiled grammars has been found to run up to five times as fast as the standard version. The compiler can be invoked manually from the command line, or used through the “Ontotext Japec Compiler” PR in the Jape_Compiler plugin.

The “Ontotext Japec Transducer” (com.ontotext.gate.japec.JapecTransducer) is a processing resource that is designed to be an alternative to the original Jape Transducer. You can simply replace gate.creole.Transducer with com.ontotext.gate.japec.JapecTransducer in your gate application and it should work as expected.

The Japec transducer takes the same parameters as the standard JAPE transducer:

grammarURL
the URL from which the grammar is to be loaded. Note that the Japec Transducer will only work on file: URLs. Also, the alternative binaryGrammarURL parameter of the standard transducer is not supported.
encoding
the character encoding used to load the grammars.
ontology
the ontology used for ontolog-aware transduction.

Its runtime parameters are likewise the same as those of the standard transducer:

document
the document to process.
inputASName
name of the AnnotationSet from which input annotations to the transducer are read.
outputASName
name of the AnnotationSet to which output annotations from the transducer are written.

The Japec compiler itself is written in Haskell. Compiled binaries are provided for Windows, Linux (x86) and Mac OS X (PowerPC), so no Haskell interpreter is required to run Japec on these platforms. For other platforms, or if you make changes to the compiler source code, you can build the compiler yourself using the Ant build file in the Jape_Compiler plugin directory. You will need to install the latest version of the Glasgow Haskell Compiler4 and associated libraries. The japec compiler can then be built by running:

../../bin/ant japec.clean japec

from the Jape_Compiler plugin directory.

9.29 ANNIC [#]

ANNIC (ANNotations-In-Context) is a full-featured annotation indexing and retrieval system. It is provided as part of an extension of the Serial Data-stores, called Searchable Serial Data-store (SSD).

ANNIC can index documents in any format supported by the GATE system (i.e., XML, HTML, RTF, e-mail, text, etc). Compared with other such query systems, it has additional features addressing issues such as extensive indexing of linguistic information associated with document content, independent of document format. It also allows indexing and extraction of information from overlapping annotations and features. Its advanced graphical user interface provides a graphical view of annotation markups over the text, along with an ability to build new queries interactively. In addition, ANNIC can be used as a first step in rule development for NLP systems as it enables the discovery and testing of patterns in corpora.

ANNIC is built on top of the Apache Lucene5 – a high performance full-featured search engine implemented in Java, which supports indexing and search of large document collections. Our choice of IR engine is due to the customisability of Lucene. For more details on how Lucene was modified to meet the requirements of indexing and querying annotations, please refer to [Aswani et al. 05].

As explained earlier, SSD is an extension of the serial data-store. In addition to the persist location, SSD asks user to provide some more information (explained later) that it uses to index the documents. Once the SSD has been initiated, user can add/remove documents/corpora to the SSD in a similar way it is done with other data-stores. When documents are added to the SSD, it automatically tries to index them. It updates the index whenever there is a change in any of the documents stored in the SSD and removes the document from the index if it is deleted from the SSD. Be warned that only the annotation sets, types and features initially indexed will be updated when adding/removing documents to the datastore. This mean, for example, that if you add a new annotation type in one of the indexed document, it will not appear in the results when searching for it.

SSD has an advanced graphical interface that allows users to issue queries over the SSD. Below we explain the parameters required by SSD and how to instantiate it, how to use its graphical interface and how to use SSD from programmatically.

9.29.1 Instantiating SSD

Steps:

  1. Right click on “Data Stores” and select “Create datastore”.
  2. From a drop-down list select “Lucene Based Searchable DataStore”.
  3. Here, you will see an input window. Please provide these parameters:
    1. DataStore URL: Select an empty folder where the DS is created.
    2. Index Location: Select an empty folder. This is where the index will be created.
    3. Annotation Sets: Here, you can provide one or more annotation sets that you wish to index or exclude from being indexed. In order to be able to index the default annotation set, you must click on the edit list icon and add an empty field to the list. If there are no annotation sets provided, all the annotation sets in all documents are indexed. In addition to all annotation sets a new combined annotation set is created in memory which is a union of all annotations from all the annotation sets of the document being indexed. This set is also indexed in order to allow users to issue queries across various annotation sets.
    4. Base-Token Type: (e.g. Token or Key.Token) These are the basic tokens of any document. Your documents must have the annotations of Base-Token-Type in order to get indexed. These basic tokens are used for displaying contextual information while searching patterns in the corpus. In case of indexing more than one annotation set, user can specify the annotation set from which the tokens should be taken (e.g. Key.Token- annotations of type Token from the annotation set called Key). In case user does not provide any annotation set name (e.g. Token), the system searches in all the annotation sets to be indexed and the base-tokens from the first annotation set with the base token annotations are taken. Please note that the documents with no base-tokens are not indexed. However, if the ”create tokens automatically” option is selected, the SSD creates base-tokens automatically. Here, each string delimited with white space is considered as a token.
    5. Index Unit Type: (e.g. Sentence, Key.Sentence) This specifies the unit of Index. In other words, annotations lying within the boundaries of these annotations are indexed (e.g. in the case of “Sentences”, no annotations that are spanned across the boundaries of two sentences are considered for indexing). User can specify from which annotation set the index unit annotations should be considered. If user does not provide any annotation set, the SSD searches among all annotation sets for index units. If this field is left empty or SSD fails to locate index units, the entire document is considered as a single unit.
    6. Features: Finally, users can specify the annotation types and features that should be indexed or excluded from being indexed. (e.g. SpaceToken and Split). If user wants to exclude only a specific feature of a specific annotation type, he/she can specify it using a ’.’ separator between the annotation type and its feature (e.g. Person.matches).
  4. Click OK. If all parameters are OK, a new empty DS will be created.
  5. Create an empty corpus and save it to the SSD.
  6. Populate it with some documents. Each document added to the corpus and eventually to the SSD is indexed automatically. If the document does not have the required annotations, that document is skipped and not indexed.

9.29.2 Search GUI


PIC

Figure 9.19: Searchable Serial Datastore Viewer.


Overview

Figure 9.19 gives a snapshot of the GUI. The top section contains a text area to write a query, options to select the input data and the output format and two icons to execute and delete a query. The central section shows a graphical visualisation of annotations and values of the result selected in the bottom results table. You can also see the annotation rows manager window where you define which annotation type and feature to display in the central section. The bottom section contains the results table of the query, i.e. the text that matches the query with their left and right contexts, the annotation set and the document. The bottom section contains also a tabbed panes of statistics.

Syntax of queries

SSD enables you to formulate versatile queries using JAPE patterns. JAPE patterns support various query formats. Below we give a few examples of JAPE pattern clauses which can be used as SSD queries. Actual queries can also be a combination of one or more of the following pattern clauses:

  1. String
  2. {AnnotationType}
  3. {AnnotationType == String}
  4. {AnnotationType.feature == feature value}
  5. {AnnotationType1, AnnotationType2.feature == featureValue}
  6. {AnnotationType1.feature == featureValue, AnnotationType2.feature == featureValue}

JAPE patterns also support the | (OR) operator. For instance, {A} ({B}|{C}) is a pattern of two annotations where the first is an annotation of type A followed by the annotation of type either B or C. ANNIC supports two operators, + and *, to specify the number of times a particular annotation or a sub pattern should appear in the main query pattern. Here, ({A})+n means one and up to n occurrences of annotation {A} and ({A})*n means zero or up to n occurrences of annotation {A}.

Below we explain the steps to search in SDS.

  1. Double click on SSD. You will see an extra tab “Lucene DataStore Searcher”. Click on it to activate the searcher GUI.
  2. Here you can specify a query to search in your SSD. The query here is a L.H.S. part of the JAPE grammar. Please refer to the following example queries:
    1. {Person} – This will return annotations of type Person from the SSD
    2. {Token.string == “Microsoft”} – This will return all occurrences of “Microsoft” from the SSD.
    3. {Person}({Token})*2{Organization} – Person followed by zero or upto two tokens followed by Organization.
    4. {Token.orth==“upperInitial”, Organization} – Token with feature orth with value set to “upperInitial” and which is also annotated as Organization.


PIC

Figure 9.20: Searchable Serial Datastore Viewer - Auto-completion.


Top section

A text-area located in the top left part of the GUI is used to input a query. You can copy/cut/paste with Control+C/X/V, undo/redo your changes with Control+Z/Y as usual. To add a new line, use Control+Enter combination keys.

Auto-completion shown on the figure 9.20 for annotation type is triggered when typing ’{’ and for feature when typing ’.’ after a valid annotation type. It shows only the annotation types and features related to the selected corpus and annotation set. If you right-click on an expression it will automatically select the shortest valid enclosing brace and if you click on a selection it will propose you to add quantifiers for allowing the expression to appear zero, one or more times.

To execute the query, click on the magnifying glass icon, use Enter key or Alt+Enter combination keys. To delete the query, click on the trash icon or use Alt+Backspace combination keys.

It is possible to have more than one corpus, each containing a different set of documents, stored in a single data-store. ANNIC, by providing a drop down box with a list of stored corpora, also allows searching within a specific (selected) corpus. Similarly a document can have more than one annotation set indexed and therefore ANNIC also provides a drop down box with a list of indexed annotation sets for the selected corpus.

A large corpus can have many hits for a given query. This may take a long time to refresh the GUI and may create inconvenience while browsing through patterns. ANNIC therefore allows you to specify a number of patterns that you wish to retrieve at once and provides a way to iterate through next pages with the Next Page of Results button. Due to technical complexities, it is not possible to visit a previous page. It is however possible to tick a check-box for retrieving all the results at the same time.

Central section

Annotation types and features to show can be selected from the annotation rows manager in clicking on the Modify Rows button in the central section. When you choose to show a feature of an annotation (e.g. feature category for annotation type Token), the central section shows colored rectangles where the annotation type are existing containing values of those features. When you choose to show only one annotation type in letting the feature column empty then all its features are displayed with empty rectangles that show their features values in a pop-up window when the mouse is over the rectangles.

Shortcuts are expression that stand for an ”AnnotationType.Feature” expression. For example, on the figure 9.19, the shortcut ”POS” stands for the expression ”Token.category”. The purpose is to make the query more readable.

When you left-clicks on any of the rectangles of the annotations rows, the respective query expression is placed at the caret position in the query text area or replace the selected expression, if any. You can also click on a word on the first line to add it to the query.

Bottom section

In the table of results, ANNIC shows each pattern retrieve from the last query executed on a row and provides a tool tip that shows the query that the selected pattern refers to.

Along with its left and right context texts, it also lists the names of the document and the annotation set that the patterns come from. When the focus changes from one row to another, the central section is updated accordingly. You can sort a table column in clicking on its header.

You can remove a result from the results table or open the document containing it in right-clicking on a result in the results table.

ANNIC provides an Export button to export in an HTML file all the results or only the selected results.

A statistics tabbed pane can be displayed on the bottom-right when clicking on the Statistics button. There is always a global statistics pane that list the count the occurrences of all annotation types for the selected corpus and annotation set.

Statistics can be obtain in 16 different ways for the datastore, matched spans of the query in the results, with or without contexts and for an annotation type, a annotation type + feature or an annotation type + feature + value. A second pane contains the one item statistics that you can add in right-clicking on a non empty rectangle or on the header of a row in the central section. You can sort a table column in clicking on its header.

9.29.3 Using SSD from your code

 
//how to instantiate a searchabledatastore  
===============================  
 
// create an instance of datastore  
LuceneDataStoreImpl ds = (LuceneDataStoreImpl)  
Factory.createDataStore(‘‘gate.persist.LuceneDataStoreImpl’’, dsLocation);  
 
// we need to set Indexer  
Indexer indexer = new LuceneIndexer(new URL(indexLocation));  
 
// set the parameters  
Map parameters = new HashMap();  
 
// specify the index url  
parameters.put(Constants.INDEX_LOCATION_URL, new URL(indexLocation));  
 
// specify the base token type  
// and specify that the tokens should be created automatically  
// if not found in the document  
parameters.put(Constants.BASE_TOKEN_ANNOTATION_TYPE, ‘‘Token’’);  
parameters.put(Constants.CREATE_TOKENS_AUTOMATICALLY, new Boolean(true));  
 
// specify the index unit type  
parameters.put(Constants.INDEX_UNIT_ANNOTATION_TYPE, ‘‘Sentence’’);  
 
// specifying the annotation sets "Key" and "Default Annotation Set"  
// to be indexed  
List<String> setsToInclude = new ArrayList<String>();  
setsToInclude.add("Key");  
setsToInclude.add("<null>");  
parameters.put(Constants.ANNOTATION_SETS_NAMES_TO_INCLUDE, setsToInclude);  
parameters.put(Constants.ANNOTATION_SETS_NAMES_TO_EXCLUDE, new ArrayList<String>());  
 
// all features should be indexed  
parameters.put(Constants.FEATURES_TO_INCLUDE, new ArrayList<String>());  
parameters.put(Constants.FEATURES_TO_EXCLUDE, new ArrayList<String>());  
 
// set the indexer  
ds.setIndexer(indexer, parameters);  
 
// set the searcher  
ds.setSearcher(new LuceneSearcher());  
 
//how to search in this datastore  
//======================  
// obtain the searcher instance  
Searcher searcher = ds.getSearcher();  
Map parameters  = new HashMap();  
 
// obtain the url of index  
String indexLocation =  
new File(((URL) ds.getIndexer().getParameters().get(Constants.INDEX_LOCATION_URL))  
.getFile()).getAbsolutePath();  
ArrayList indexLocations = new ArrayList();  
indexLocations.add(indexLocation);  
 
 
// corpus2SearchIn = mention corpus name that was indexed here.  
 
// the annotation set to search in  
String annotationSet2SearchIn = "Key";  
 
// set the parameter  
parameters.put(Constants.INDEX_LOCATIONS,indexLocations);  
parameters.put(Constants.CORPUS_ID, corpus2SearchIn);  
parameters.put(Constants.ANNOTATION_SET_ID, annotationSet);  
parameters.put(Constants.CONTEXT_WINDOW, contextWindow);  
parameters.put(Constants.NO_OF_PATTERNS, noOfPatterns);  
 
// search  
String query = ‘‘{Person}’’;  
Hit[] hits = searcher.search(query, parameters);  

9.30 Annotation Merging [#]

If we got annotations about the same subject on the same document from different annotators, we need to merge those annotations to form a unified annotations. The merging is applied to the annotations with the same annotation type but in different annotation sets of the same document. We implemented two approaches for annotation merging. The first method takes a parameter numMinK and selects the annotation on which at least numMinK annotators agree. If two or more merged annotations have the same span, then the annotation with the most supporters is kept and other annotations with the same span are discarded. The second method selects one annotation from those annotations with the same span, which the majority of the annotators support. Note that if one annotator did not create the annotation with the particular span, we count it as one non-support of the annotation with the span. If it turns out that the majority of the annotators did not support the annotation with that span, then no annotation with the span would be put into the merged annotations.

9.30.1 Two implemented methods

The following two static methods in the class gate.util.AnnotationMerging are for the merging methods. The two methods have very similar input and output parameters. Each of the methods takes an array of annotation sets, which should be the same annotation type on the same document from different annotators, as input. If there is one annotation feature indicating the annotation label, the name of the annotation feature is another input. Otherwise, set the input parameter for the annotation feature as null. The output is a map the key of which is one merged annotation and the value of which represents the annotators (in terms of the indices of the array of annotation sets) who support the annotation. The method also has a boolean input parameter to indicate if or not the annotations from different annotators are based on the same set of instances, which can be determined by the static method public boolean isSameInstancesForAnnotators(AnnotationSet[] annsA) in the class gate.util.IaaCalculation. One instance corresponds to all the annotations with the same span. If the annotation sets are based on the same set of instances, the merging methods will ensure that the merged annotations are on the same set of instances.

9.30.2 Annotation Merging Plugin

The annotation merging methods are wrapped in the plugin such that they can be used in a PR in the GATE GUI. The plugin can be used as a PR in an application of pipeline or corpus pipeline. To use the PR, each document in the pipeline or the corpus pipeline should have the annotation sets for merging. The annotation merging PR has no loading parameter but has several run-time parameters specifying the annotation merging task, explained in the following.

9.31 OntoRoot Gazetteer [#]

OntoRoot Gazetteer is a type of a dynamically created gazetteer that is, in combination with few other generic GATE resources, capable of producing ontology-aware annotations over the given content with regards to given ontology. This gazetteer is a part of Ontology_Based_Gazetteer plugin that has been developed as a part of TAO project.

9.31.1 How does it work? [#]

To produce ontology-aware annotations i.e. annotations that link to the specific concepts or relations from the ontology, it is essential to pre-process the Ontology Resources (e.g., Classes, Instances, Properties) and extract their human-understandable lexicalisations.

As a precondition for extracting human-understandable content from the ontology first a list of the following is being created:

Each item from the list is further processed so that:

Each item from this list is analysed separately by the Onto Root Application (ORA) on execution (see figure 9.21). The Onto Root Application first tokenises each linguistic term, then assigns part-of-speech and lemma information to each token.


PIC   

Figure 9.21: Building Ontology Resource Root (OntoRoot) Gazetteer from the Ontology


As a result of that pre-processing, each token in the terms will have additional feature named ’root’, which contains the lemma as created by the morphological analyser. It is this lemma or a set of lemmas which are then added to the dynamic gazetteer list, created from the ontology.

For instance, if there is a resource with a short name (i.e., fragment identifier) ProjectName, without any assigned properties the created list before executing the OntoRoot gazetteer collection will contain following the strings:

Each of the item from the list is then analysed separately and the results would be the same as the input strings, as all of entries are nouns given in singular form.

9.31.2 Initialisation of OntoRoot Gazetteer [#]

To initialise the gazetteer there are few mandatory parameters:

and few optional ones:

9.32 Chinese Word Segmentation [#]

Unlike English, Chinese text does not have a symbol (or delimiter) such as blank space to explicitly separate a word from the surrounding words. Therefore, for automatic Chinese text processing, we may need a system to recognise the words in Chinese text, a problem known as Chinese word segmentation. The plugin described in this section performs the task of Chinese word segmentation. It is based on our work using the Perceptron learning algorithm for Chinese word segmentation task of the Sighan 20057. [Li et al. 05c]. Our Perceptron based system has achieved very good performance in the Sighan-05 task.

The plugin has the name as ChineseSegmenter and is available in the GATE distribution. The corresponding processing resource’s name is Chinese Segmenter PR. Once you load the PR into GATE, you may put it into a Pipeline application8. The plugin can be used to learn a model from the segmented Chinese text as training data. It can also use the learned model to segment Chinese text. The plugin can use different learning algorithms to learn different models. It can deal with different codes for Chinese text, such as UTF-8, GB2312 or BIG5. All those difference options can be selected by setting the run-time parameters of the plugin.

The plugin has five run-time parameters, which are described in the following.

The following PAUM models are available for the plugin and can be downloaded from the website http://www.dcs.shef.ac.uk/~yaoyong/. In detail, those models were learned using the PAUM learning algorithm from the corpora provided by Sighan-05 bakeoff task.

As you can see, those models were learned using different training data and different Chinese text codes of the same training data. The PKU training data are the news articles published in the mainland China and use the simplified Chinese, while the AS training data are the news articles published in Taiwan and use the traditional Chinese. Hence, if your text are in the simplified Chinese, you can use the models trained by the PKU data. For example, if your text are in the traditional Chinese, you need use the models trained by the AS data. If your data are in GB2312 code or any compatible code, you need use the model trained by the corpus in GB2312 code.

Note that the segmented Chinese text (either used as training data or produced by this plugin) use the blank space to separate a word from its surrounding words. Hence, if your data are in the Unicde such as UTF-8, you can use the GATE Unicode Tokeniser to process the segmented text to add the Token annotations into your text to represent the Chinese words. Once you get the annotations for all the Chinese words, you can perform further processing such as POS tagging and named entity recogntion.

9.33 Copying Annotations Between Documents [#]

Sometimes a document has two copies, each of which was annotated by two annotators for the same task. Then we want to copy the annotations in one copy to another copy of document, in order to save them using less memory or to use the annotation merging plugin or IAA plugin to process them. This plugin does exactly this task – it copies the specified annotations from one document to another document.

The plugin is named as copyAS2AnoDoc and is available with the GATE distribution. When loading the plugin into GATE, it represented as the processing resource Copy Anns to Another Doc PR. You need to put the PR into a Corpus Pipeline to use it. The plugin does not have any initialisation parameter. It has several run-time parameters, which specify the annotations to be copied, the source documents and target documents. In detail, the run-time parameters are:

The Corpus parameter of the Corpus Pipeline application containing the plugin specifies a corpus which contains the target documents. Given one (target) document in the corpus, the plugin tries to find a source document in the source directory specified by the parameter sourceFilesURL, according to the similarity of the names of the source and target documents. The similarity of two file names is calculated by comparing the two strings of names from the start to the end of the strings. The two names have greater similarity if they share more characters from the beginning of the strings. For example, suppose two target documents have the names aabcc.xml and abcab.xml and the three source files have the names abacc.xml, abcbb.xml and aacc.xml, respectively. Then the target document aabcc.xml has the corresponding source document aacc.xml, and abcab.xml has the corresponding source document abcbb.xml. The plugin should copy the annotations within the document if the source and target directories are the same.

Chapter 10
Working with Ontologies [#]

An increasing number of NLP projects make use of taxonomic data structures and of ontologies. The use of NLP techniques for (semi-)automatically generating Semantic Web meta-data is also a growing trend. The advancements in the Semantic Web research area have led to a variety of standards for representing ontologies and an increasing number of tools and programming libraries for managing ontologies are becoming available. All this underlines the need for NLP systems to access ontological information and has led to the addition of support for ontologies in GATE.

The various ontology representation formalisms (such as RDF-Schema [Lassila & Swick 99], OWL and its variants [Dean et al. 04], DAML-OIL [Horrocks & vanHarmelen 01]) have their advantages and disadvantages as well as their idiosyncrasies. Rather than attempting to choose one of the formalisms based on what can only be subjective criteria and running the risk of obsolescence when that particular formalism falls out of grace with the research community or is superseded by a newer one, the GATE ontology support is aiming at providing an abstraction layer between the actual representation mechanism and the NLP modules making use of it. It consists of an in-memory data model for ontologies, an API providing access to that representation, a visual resource displaying the information, and input/output capabilities for accessing files containing ontologies using various standards. This approach has well-proved benefits, because it enables each application to use this format-independent model when dealing with ontologies, thus making the application immune to changes in the underlining ontology formats. If a new format needs to be supported, the application can automatically start using ontologies in this format, by simply including the correct tool that converts the format into the common model. From a language engineer’s perspective the advantage is that they only need to learn one API and model, rather than having to deal with many different ontology formats. This approach is similar to the way we deal with document formats.

10.1 Data Model for Ontologies [#]

In order to work as an abstraction layer, the GATE ontology implementation supports only those features common to all formalisms, which are also the features most widely used. All the information that is specific to a given representation model and cannot be represented in GATE is ignored. Currently, the ontology data model has support for hierarchies of classes and restrictions, hierarchies of properties and instances (also known as individuals).

10.1.1 Hierarchies of classes and restrictions

The central role in the ontology data model is played by the class hierarchy (or taxonomy). This consists of a set of classes linked by subClassOf, superClassOf and equivalentClassAs relations. Each ontology class h