Log in Help
Print
Homereleasesgate-5.1-build3431-ALLdoc 〉 tao
 


Developing Language Processing Components with GATE
Version 5 (a User Guide)
  For GATE version 5.1
  (built December 8, 2009)

  Hamish Cunningham
  Diana Maynard
  Kalina Bontcheva
  Valentin Tablan
  Marin Dimitrov
  Mike Dowman
  Niraj Aswani
  Ian Roberts
  Yaoyong Li
  Adam Funk
  Genevieve Gorrell
  Johann Petrak
  Horacio Saggion
  Danica Damljanovic
  Angus Roberts

  The University of Sheffield 2001-2009

  http://gate.ac.uk/

PDF version

Single HTML page

Multiple HTML pages


Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT) and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), Matrixware, the Information Retrieval Facility and several EU-funded projects (SEKT, TAO, NeOn, MediaCampaign, MUSING, KnowledgeWeb, PrestoSpace, h-TechSight, enIRaF).

Contents

I  GATE Basics
1 Introduction
 1.1 How to Use this Text
 1.2 Context
 1.3 Overview
  1.3.1 Developing and Deploying Language Processing Facilities
  1.3.2 Built-In Components
  1.3.3 Additional Facilities
  1.3.4 An Example
 1.4 Some Evaluations
 1.5 Changes in this Version
  1.5.1 Version 5.1 (December 2009)
 1.6 Further Reading
2 Installing and Running GATE
 2.1 Downloading GATE
 2.2 Installing and Running GATE
  2.2.1 The Easy Way
  2.2.2 The Hard Way (1)
  2.2.3 The Hard Way (2): Subversion
 2.3 Using System Properties with GATE
 2.4 Configuring GATE
 2.5 Building GATE
  2.5.1 Using GATE with Maven or JPF
 2.6 Uninstalling GATE
 2.7 Troubleshooting
  2.7.1 I don’t see the Java console messages under Windows
  2.7.2 When I execute GATE, nothing happens
  2.7.3 On Ubuntu, GATE is very slow or doesn’t start
  2.7.4 How to use GATE on a 64 bit system?
  2.7.5 I got the error: Could not reserve enough space for object heap
  2.7.6 I got the error: log4j:WARN No appenders could be found for logger...
  2.7.7 Text is incorrectly refreshed after scrolling and become unreadable
3 Using GATE Developer
 3.1 The GATE Developer Main Window
 3.2 Loading and Viewing Documents
 3.3 Creating and Viewing Corpora
 3.4 Working with Annotations
  3.4.1 The Annotation Sets View
  3.4.2 The Annotations List View
  3.4.3 The Annotations Stack View
  3.4.4 The Co-reference Editor
  3.4.5 Creating and Editing Annotations
  3.4.6 Schema-Driven Editing
  3.4.7 Printing Text with Annotations
 3.5 Using CREOLE Plugins
 3.6 Loading and Using Processing Resources
 3.7 Creating and Running an Application
  3.7.1 Running PRs Conditionally on Document Features
  3.7.2 Doing Information Extraction with ANNIE
  3.7.3 Modifying ANNIE
 3.8 Saving Applications and Language Resources
  3.8.1 Saving Documents to File
  3.8.2 Saving and Restoring LRs in Datastores
  3.8.3 Saving Resource Parameter States to File
  3.8.4 Saving an Application with its Resources (e.g. GATE Teamware)
 3.9 Keyboard Shortcuts
 3.10 Miscellaneous
  3.10.1 Stopping GATE from Restoring Developer Sessions/Options
  3.10.2 Working with Unicode
4 CREOLE: the GATE Component Model
 4.1 The Web and CREOLE
 4.2 The GATE Framework
 4.3 The Lifecycle of a CREOLE Resource
 4.4 Processing Resources and Applications
 4.5 Language Resources and Datastores
 4.6 Built-in CREOLE Resources
 4.7 CREOLE Resource Configuration
  4.7.1 Configuration with XML
  4.7.2 Configuring Resources using Annotations
  4.7.3 Mixing the Configuration Styles
5 Language Resources: Corpora, Documents and Annotations
 5.1 Features: Simple Attribute/Value Data
 5.2 Corpora: Sets of Documents plus Features
 5.3 Documents: Content plus Annotations plus Features
 5.4 Annotations: Directed Acyclic Graphs
  5.4.1 Annotation Schemas
  5.4.2 Examples of Annotated Documents
  5.4.3 Creating, Viewing and Editing Diverse Annotation Types
 5.5 Document Formats
  5.5.1 Detecting the Right Reader
  5.5.2 XML
  5.5.3 HTML
  5.5.4 SGML
  5.5.5 Plain text
  5.5.6 RTF
  5.5.7 Email
 5.6 XML Input/Output
6 ANNIE: a Nearly-New Information Extraction System
 6.1 Document Reset
 6.2 Tokeniser
  6.2.1 Tokeniser Rules
  6.2.2 Token Types
  6.2.3 English Tokeniser
 6.3 Gazetteer
 6.4 Sentence Splitter
 6.5 RegEx Sentence Splitter
 6.6 Part of Speech Tagger
 6.7 Semantic Tagger
 6.8 Orthographic Coreference (OrthoMatcher)
  6.8.1 GATE Interface
  6.8.2 Resources
  6.8.3 Processing
 6.9 Pronominal Coreference
  6.9.1 Quoted Speech Submodule
  6.9.2 Pleonastic It Submodule
  6.9.3 Pronominal Resolution Submodule
  6.9.4 Detailed Description of the Algorithm
 6.10 A Walk-Through Example
  6.10.1 Step 1 - Tokenisation
  6.10.2 Step 2 - List Lookup
  6.10.3 Step 3 - Grammar Rules
II  GATE for Advanced Users
7 GATE Embedded
 7.1 Quick Start with GATE Embedded
 7.2 Resource Management in GATE Embedded
 7.3 Using CREOLE Plugins
 7.4 Language Resources
  7.4.1 GATE Documents
  7.4.2 Feature Maps
  7.4.3 Annotation Sets
  7.4.4 Annotations
  7.4.5 GATE Corpora
 7.5 Processing Resources
 7.6 Controllers
 7.7 Persistent Applications
 7.8 Ontologies
 7.9 Creating a New Annotation Schema
 7.10 Creating a New CREOLE Resource
 7.11 Adding Support for a New Document Format
 7.12 Using GATE Embedded in a Multithreaded Environment
 7.13 Using GATE Embedded within a Spring Application
 7.14 Using GATE Embedded within a Tomcat Web Application
  7.14.1 Recommended Directory Structure
  7.14.2 Configuration Files
  7.14.3 Initialization Code
 7.15 Groovy Scripting for GATE
 7.16 Saving Config Data to gate.xml
 7.17 Annotation merging through the API
8 JAPE: Regular Expressions over Annotations
 8.1 The Left-Hand Side
  8.1.1 Matching a Simple Text String
  8.1.2 Matching Entire Annotation Types
  8.1.3 Using Attributes and Values
  8.1.4 Using Meta-Properties
  8.1.5 Multiple Pattern/Action Pairs
  8.1.6 LHS Macros
  8.1.7 Using Context
  8.1.8 Multi-Constraint Statements
  8.1.9 Negation
  8.1.10 Escaping Special Characters
 8.2 LHS Operators in Detail
  8.2.1 Compositional Operators
  8.2.2 Matching Operators
 8.3 The Right-Hand Side
  8.3.1 A Simple Example
  8.3.2 Copying Feature Values from the LHS to the RHS
  8.3.3 RHS Macros
 8.4 Use of Priority
 8.5 Using Phases Sequentially
 8.6 Using Java Code on the RHS
  8.6.1 A More Complex Example
  8.6.2 Adding a Feature to the Document
  8.6.3 Finding the Tokens of a Matched Annotation
  8.6.4 Using Named Blocks
  8.6.5 Java RHS Overview
 8.7 Optimising for Speed
 8.8 Ontology Aware Grammar Transduction
 8.9 Serializing JAPE Transducer
  8.9.1 How to Serialize?
  8.9.2 How to Use the Serialized Grammar File?
 8.10 The JAPE Debugger
 8.11 Notes for Montreal Transducer Users
9 ANNIC: ANNotations-In-Context
 9.1 Instantiating SSD
 9.2 Search GUI
  9.2.1 Overview
  9.2.2 Syntax of Queries
  9.2.3 Top Section
  9.2.4 Central Section
  9.2.5 Bottom Section
 9.3 Using SSD from GATE Embedded
10 Performance Evaluation of Language Analysers
 10.1 Metrics for Evaluation in Information Extraction
  10.1.1 Annotation Relations
  10.1.2 Cohen’s Kappa
  10.1.3 Precision, Recall, F-Measure
  10.1.4 Macro and Micro Averaging
 10.2 The Annotation Diff Tool
  10.2.1 Performing Evaluation with the Annotation Diff Tool
 10.3 Corpus Quality Assurance
  10.3.1 Description of the interface
  10.3.2 Step by step usage
  10.3.3 Details of the Corpus statistics table
  10.3.4 Details of the Document statistics table
 10.4 Corpus Benchmark Tool
  10.4.1 Preparing the Corpora for Use
  10.4.2 Defining Properties
  10.4.3 Running the Tool
  10.4.4 The Results
 10.5 A Plugin Computing Inter-Annotator Agreement (IAA)
  10.5.1 IAA for Classification
  10.5.2 IAA For Named Entity Annotation
  10.5.3 The BDM-Based IAA Scores
 10.6 A Plugin Computing the BDM Scores for an Ontology
11 Profiling Processing Resources
 11.1 Overview
  11.1.1 Features
  11.1.2 Limitations
 11.2 Graphical User Interface
 11.3 Command Line Interface
 11.4 Application Programming Interface
  11.4.1 Log4j.properties
  11.4.2 Benchmark log format
  11.4.3 Enabling profiling
  11.4.4 Reporting tool
12 Developing GATE
 12.1 Reporting Bugs and Requesting Features
 12.2 Contributing Patches
 12.3 Creating New Plugins
  12.3.1 Where to Keep Plugins in the GATE Hierarchy
  12.3.2 What to Call your Plugin
  12.3.3 Writing a New PR
  12.3.4 Writing a New VR
  12.3.5 Adding Plugins to the Nightly Build
 12.4 Updating this User Guide
  12.4.1 Building the User Guide
  12.4.2 Making Changes to the User Guide
III  CREOLE Plugins
13 Gazetteers
 13.1 Introduction to Gazetteers
  13.1.1 Creating and Modifying Gazetteer Lists
 13.2 Gazetteer Visual Resource - GAZE
  13.2.1 Display Modes
  13.2.2 Linear Definition Pane
  13.2.3 Linear Definition Toolbar
  13.2.4 Operations on Linear Definition Nodes
  13.2.5 Gazetteer List Pane
  13.2.6 Mapping Definition Pane
 13.3 OntoGazetteer
 13.4 Gaze Ontology Gazetteer Editor
  13.4.1 The Gaze Gazetteer List and Mapping Editor
  13.4.2 The Gaze Ontology Editor
 13.5 Hash Gazetteer
  13.5.1 Prerequisites
  13.5.2 Parameters
 13.6 Flexible Gazetteer
 13.7 Gazetteer List Collector
 13.8 OntoRoot Gazetteer
  13.8.1 How Does it Work?
  13.8.2 Initialisation of OntoRoot Gazetteer
 13.9 Large KB Gazetteer
  13.9.1 Quick usage overview
  13.9.2 Dictionary setup
  13.9.3 Additional dictionary configuration
  13.9.4 Processing Resource Configuration
  13.9.5 Runtime configuration
  13.9.6 Semantic Enrichment PR
14 Working with Ontologies
 14.1 Data Model for Ontologies
  14.1.1 Hierarchies of Classes and Restrictions
  14.1.2 Instances
  14.1.3 Hierarchies of Properties
  14.1.4 URIs
 14.2 Ontology Event Model
  14.2.1 What Happens when a Resource is Deleted?
 14.3 The Ontology Plugin: Current Implementation
  14.3.1 The OWLIMOntology Language Resource
  14.3.2 The ConnectSesameOntology Language Resource
  14.3.3 The CreateSesameOntology Language Resource
  14.3.4 The OWLIM2 Backwards-Compatible Language Resource
 14.4 The Ontology_OWLIM2 plugin: backwards-compatible implementation
  14.4.1 The OWLIMOntologyLR Language Resource
 14.5 GATE Ontology Editor
 14.6 Ontology Annotation Tool
  14.6.1 Viewing Annotated Text
  14.6.2 Editing Existing Annotations
  14.6.3 Adding New Annotations
  14.6.4 Options
 14.7 Using the ontology API
 14.8 Using the ontology API (old version)
 14.9 Ontology-Aware JAPE Transducer
 14.10 Annotating Text with Ontological Information
 14.11 Populating Ontologies
 14.12 Ontology API and Implementation Changes
  14.12.1 Differences between the implementation plugins
  14.12.2 Changes in the Ontology API
15 Machine Learning
 15.1 ML Generalities
  15.1.1 Some Definitions
  15.1.2 GATE-Specific Interpretation of the Above Definitions
 15.2 Batch Learning PR
  15.2.1 Batch Learning PR Configuration File Settings
  15.2.2 Case Studies for the Three Learning Types
  15.2.3 How to Use the Batch Learning PR in GATE Developer
  15.2.4 Output of the Batch Learning PR
  15.2.5 Using the Batch Learning PR from the API
 15.3 Machine Learning PR
  15.3.1 The DATASET Element
  15.3.2 The ENGINE Element
  15.3.3 The WEKA Wrapper
  15.3.4 The MAXENT Wrapper
  15.3.5 The SVM Light Wrapper
  15.3.6 Example Configuration File
16 Tools for Alignment Tasks
 16.1 Introduction
 16.2 The Tools
  16.2.1 Compound Document
  16.2.2 Compound Document Editor
  16.2.3 Composite Document
  16.2.4 DeleteMembersPR
  16.2.5 SwitchMembersPR
  16.2.6 Saving as XML
  16.2.7 Alignment Editor
  16.2.8 Section-by-Section Processing
17 Parsers and Taggers
 17.1 Verb Group Chunker
 17.2 Noun Phrase Chunker
  17.2.1 Differences from the Original
  17.2.2 Using the Chunker
 17.3 Tree Tagger
  17.3.1 POS Tags
 17.4 TaggerFramework
 17.5 Chemistry Tagger
  17.5.1 Using the Tagger
 17.6 ABNER
 17.7 Stemmer
  17.7.1 Algorithms
 17.8 GATE Morphological Analyzer
  17.8.1 Rule File
 17.9 MiniPar Parser
  17.9.1 Platform Supported
  17.9.2 Resources
  17.9.3 Parameters
  17.9.4 Prerequisites
  17.9.5 Grammatical Relationships
 17.10 RASP Parser
 17.11 SUPPLE Parser
  17.11.1 Requirements
  17.11.2 Building SUPPLE
  17.11.3 Running the Parser in GATE
  17.11.4 Viewing the Parse Tree
  17.11.5 System Properties
  17.11.6 Configuration Files
  17.11.7 Parser and Grammar
  17.11.8 Mapping Named Entities
  17.11.9 Upgrading from BuChart to SUPPLE
 17.12 Stanford Parser
  17.12.1 Input Requirements
  17.12.2 Initialization Parameters
  17.12.3 Runtime Parameters
 17.13 OpenCalais, LingPipe and OpenNLP
18 Combining GATE and UIMA
 18.1 Embedding a UIMA AE in GATE
  18.1.1 Mapping File Format
  18.1.2 The UIMA Component Descriptor
  18.1.3 Using the AnalysisEnginePR
 18.2 Embedding a GATE CorpusController in UIMA
  18.2.1 Mapping File Format
  18.2.2 The GATE Application Definition
  18.2.3 Configuring the GATEApplicationAnnotator
19 More (CREOLE) Plugins
 19.1 Language Plugins
  19.1.1 French Plugin
  19.1.2 German Plugin
  19.1.3 Romanian Plugin
  19.1.4 Arabic Plugin
  19.1.5 Chinese Plugin
  19.1.6 Hindi Plugin
 19.2 Flexible Exporter
 19.3 Annotation Set Transfer
 19.4 Information Retrieval in GATE
  19.4.1 Using the IR Functionality in GATE
  19.4.2 Using the IR API
 19.5 Websphinx Web Crawler
  19.5.1 Using the Crawler PR
 19.6 Google Plugin
 19.7 Yahoo Plugin
  19.7.1 Using the YahooPR
 19.8 WordNet in GATE
  19.8.1 The WordNet API
 19.9 Kea - Automatic Keyphrase Detection
  19.9.1 Using the ‘KEA Keyphrase Extractor’ PR
  19.9.2 Using Kea Corpora
 19.10 Ontotext JapeC Compiler
 19.11 Annotation Merging Plugin
 19.12 Chinese Word Segmentation
 19.13 Copying Annotations between Documents
 19.14 OpenCalais Plugin
 19.15 LingPipe Plugin
  19.15.1 LingPipe Tokenizer PR
  19.15.2 LingPipe Sentence Splitter PR
  19.15.3 LingPipe POS Tagger PR
  19.15.4 LingPipe NER PR
  19.15.5 LingPipe Language Identifier PR
 19.16 OpenNLP Plugin
  19.16.1 Parameters common to all PRs
  19.16.2 OpenNLP PRs
  19.16.3 Training new models
 19.17 Inter Annotator Agreement
 19.18 Balanced Distance Metric Computation
 19.19 Schema Annotation Editor
Appendices
A Change Log
 A.1 Version 5.1 (December 2009)
  A.1.1 New Features
  A.1.2 JAPE improvements
  A.1.3 Other improvements and bug fixes
 A.2 Version 5.0 (May 2009)
  A.2.1 Major New Features
  A.2.2 Other New Features and Improvements
  A.2.3 Specific Bug Fixes
 A.3 Version 4.0 (July 2007)
  A.3.1 Major New Features
  A.3.2 Other New Features and Improvements
  A.3.3 Bug Fixes and Optimizations
 A.4 Version 3.1 (April 2006)
  A.4.1 Major New Features
  A.4.2 Other New Features and Improvements
  A.4.3 Bug Fixes
 A.5 January 2005
 A.6 December 2004
 A.7 September 2004
 A.8 Version 3 Beta 1 (August 2004)
 A.9 July 2004
 A.10 June 2004
 A.11 April 2004
 A.12 March 2004
 A.13 Version 2.2 – August 2003
 A.14 Version 2.1 – February 2003
 A.15 June 2002
B Version 5.1 Plugins Name Map
C Design Notes
 C.1 Patterns
  C.1.1 Components
  C.1.2 Model, view, controller
  C.1.3 Interfaces
 C.2 Exception Handling
D JAPE: Implementation
 D.1 Formal Description of the JAPE Grammar
 D.2 Relation to CPSL
 D.3 Initialisation of a JAPE Grammar
 D.4 Execution of JAPE Grammars
 D.5 Using a Different Java Compiler
E Ant Tasks for GATE
 E.1 Declaring the Tasks
 E.2 The packagegapp task - bundling an application with its dependencies
  E.2.1 Introduction
  E.2.2 Basic Usage
  E.2.3 Handling Non-Plugin Resources
  E.2.4 Streamlining your Plugins
  E.2.5 Bundling Extra Resources
 E.3 The expandcreoles Task - Merging Annotation-Driven Config into creole.xml
F Named-Entity State Machine Patterns
 F.1 Main.jape
 F.2 first.jape
 F.3 firstname.jape
 F.4 name.jape
  F.4.1 Person
  F.4.2 Location
  F.4.3 Organization
  F.4.4 Ambiguities
  F.4.5 Contextual information
 F.5 name_post.jape
 F.6 date_pre.jape
 F.7 date.jape
 F.8 reldate.jape
 F.9 number.jape
 F.10 address.jape
 F.11 url.jape
 F.12 identifier.jape
 F.13 jobtitle.jape
 F.14 final.jape
 F.15 unknown.jape
 F.16 name_context.jape
 F.17 org_context.jape
 F.18 loc_context.jape
 F.19 clean.jape
G Part-of-Speech Tags used in the Hepple Tagger
References

Part I
GATE Basics [#]

Chapter 1
Introduction [#]

Software documentation is like sex: when it is good, it is very, very good; and when it is bad, it is better than nothing. (Anonymous.)

There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies; the other way is to make it so complicated that there are no obvious deficiencies. (C.A.R. Hoare)

A computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. (The Structure and Interpretation of Computer Programs, H. Abelson, G. Sussman and J. Sussman, 1985.)

If you try to make something beautiful, it is often ugly. If you try to make something useful, it is often beautiful. (Oscar Wilde)1

GATE2 is an infrastructure for developing and deploying software components that process human language. It is nearly 15 years old and is in active use for all types of computational task involving human language. GATE excels at text analysis of all shapes and sizes. From large corporations to small startups, from multi-million research consortia to undergraduate projects, our user community is the largest and most diverse of any system of this type, and is spread across all but one of the continents3.

GATE is open source free software; users can obtain free support from the user and developer community via GATE.ac.uk or on a commercial basis from our industrial partners. We are the biggest open source language processing project with a development team more than double the size of the largest comparable projects (many of which are integrated with GATE4). More than 5 million has been invested in GATE development5; our objective is to make sure that this continues to be money well spent for all GATE’s users.

GATE has grown over the years to include a desktop client for developers, a workflow-based web application, a Java library, an architecture and a process. GATE is:

We also develop:

For more information see the family pages.

One of our original motivations was to remove the necessity for solving common engineering problems before doing useful research, or re-engineering before deploying research results into applications. Core functions of GATE take care of the lion’s share of the engineering:

On top of the core functions GATE includes components for diverse language processing tasks, e.g. parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. GATE Developer and Embedded are supplied with an Information Extraction system (ANNIE) which has been adapted and evaluated very widely (numerous industrial systems, research systems evaluated in MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.). ANNIE is often used to create RDF or OWL (metadata) for unstructured content (semantic annotation).

GATE version 1 was written in the mid-1990s; at the turn of the new millennium we completely rewrote the system in Java; version 5 was released in June 2009. We believe that GATE is the leading system of its type, but as scientists we have to advise you not to take our word for it; that’s why we’ve measured our software in many of the competitive evaluations over the last decade-and-a-half (MUC, TREC, ACE, DUC and more; see Section 1.4 for details). We invite you to give it a try, to get involved with the GATE community, and to contribute to human language science, engineering and development.

This book describes how to use GATE to develop language processing components, test their performance and deploy them as parts of other applications. In the rest of this chapter:

Note: if you don’t see the component you need in this document, or if we mention a component that you can’t see in the software, contact gate-users@lists.sourceforge.net7 – various components are developed by our collaborators, who we will be happy to put you in contact with. (Often the process of getting a new component is as simple as typing the URL into GATE Developer; the system will do the rest.)

1.1 How to Use this Text [#]

The material presented in this book ranges from the conceptual (e.g. ‘what is software architecture?’) to practical instructions for programmers (e.g. how to deal with GATE exceptions) and linguists (e.g. how to write a pattern grammar). Furthermore, GATE’s highly extensible nature means that new functionality is constantly being added in the form of new plugins. Important functionality is as likely to be located in a plugin as it is to be integrated into the GATE core. This presents something of an organisational challenge. Our (no doubt imperfect) solution is to divide this book into three parts. Part I covers installation, using the GATE Developer GUI and using ANNIE, as well as providing some background and theory. We recommend the new user to begin with Part I. Part II covers the more advanced of the core GATE functionality; the GATE Embedded API and JAPE pattern language among other things. Part III provides a reference for the numerous plugins that have been created for GATE. Although ANNIE provides a good starting point, the user will soon wish to explore other resources, and so will need to consult this part of the text. We recommend that Part III be used as a reference, to be dipped into as necessary. In Part III, plugins are grouped into broad areas of functionality.

1.2 Context [#]

GATE can be thought of as a Software Architecture for Language Engineering [Cunningham 00].

‘Software Architecture’ is used rather loosely here to mean computer infrastructure for software development, including development environments and frameworks, as well as the more usual use of the term to denote a macro-level organisational structure for software systems [Shaw & Garlan 96].

Language Engineering (LE) may be defined as:

…the discipline or act of engineering software systems that perform tasks involving processing human language. Both the construction process and its outputs are measurable and predictable. The literature of the field relates to both application of relevant scientific results and a body of practice. [Cunningham 99a]

The relevant scientific results in this case are the outputs of Computational Linguistics, Natural Language Processing and Artificial Intelligence in general. Unlike these other disciplines, LE, as an engineering discipline, entails predictability, both of the process of constructing LE-based software and of the performance of that software after its completion and deployment in applications.

Some working definitions:

  1. Computational Linguistics (CL): science of language that uses computation as an investigative tool.
  2. Natural Language Processing (NLP): science of computation whose subject matter is data structures and algorithms for computer processing of human language.
  3. Language Engineering (LE): building NLP systems whose cost and outputs are measurable and predictable.
  4. Software Architecture: macro-level organisational principles for families of systems. In this context is also used as infrastructure.
  5. Software Architecture for Language Engineering (SALE): software infrastructure, architecture and development tools for applied CL, NLP and LE.

(Of course the practice of these fields is broader and more complex than these definitions.)

In the scientific endeavours of NLP and CL, GATE’s role is to support experimentation. In this context GATE’s significant features include support for automated measurement (see Chapter 10), providing a ‘level playing field’ where results can easily be repeated across different sites and environments, and reducing research overheads in various ways.

1.3 Overview [#]

1.3.1 Developing and Deploying Language Processing Facilities [#]

GATE as an architecture suggests that the elements of software systems that process natural language can usefully be broken down into various types of component, known as resources8. Components are reusable software chunks with well-defined interfaces, and are a popular architectural form, used in Sun’s Java Beans and Microsoft’s .Net, for example. GATE components are specialised types of Java Bean, and come in three flavours:

These definitions can be blurred in practice as necessary.

Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection of REusable Objects for Language Engineering. All the resources are packaged as Java Archive (or ‘JAR’) files, plus some XML configuration data. The JAR and XML files are made available to GATE by putting them on a web server, or simply placing them in the local file space. Section 1.3.2 introduces GATE’s built-in resource set.

When using GATE to develop language processing functionality for an application, the developer uses GATE Developer and GATE Embedded to construct resources of the three types. This may involve programming, or the development of Language Resources such as grammars that are used by existing Processing Resources, or a mixture of both. GATE Developer is used for visualisation of the data structures produced and consumed during processing, and for debugging, performance measurement and so on. For example, figure 1.1 is a screenshot of one of the visualisation tools.


PIC


Figure 1.1: One of GATE’s visual resources


GATE Developer is analogous to systems like Mathematica for Mathematicians, or JBuilder for Java programmers: it provides a convenient graphical environment for research and development of language processing software.

When an appropriate set of resources have been developed, they can then be embedded in the target client application using GATE Embedded. GATE Embedded is supplied as a series of JAR files.9 To embed GATE-based language processing facilities in an application, these JAR files are all that is needed, along with JAR files and XML configuration files for the various resources that make up the new facilities.

1.3.2 Built-In Components [#]

GATE includes resources for common LE data structures and algorithms, including documents, corpora and various annotation types, a set of language analysis components for Information Extraction and a range of data visualisation and editing components.

GATE supports documents in a variety of formats including XML, RTF, email, HTML, SGML and plain text. In all cases the format is analysed and converted into a single unified model of annotation. The annotation format is a modified form the TIPSTER format [Grishman 97] which has been made largely compatible with the Atlas format [Bird & Liberman 99], and uses the now standard mechanism of ‘stand-off markup’. GATE documents, corpora and annotations are stored in databases of various sorts, visualised via the development environment, and accessed at code level via the framework. See Chapter 5 for more details of corpora etc.

A family of Processing Resources for language analysis is included in the shape of ANNIE, A Nearly-New Information Extraction system. These components use finite state techniques to implement various tasks from tokenisation to semantic tagging or verb phrase chunking. All ANNIE components communicate exclusively via GATE’s document and annotation resources. See Chapter 6 for more details. Other CREOLE resources are described in Part III.

1.3.3 Additional Facilities [#]

Three other facilities in GATE deserve special mention:

And by version 4 it will make a mean cup of tea.

1.3.4 An Example [#]

This section gives a very brief example of a typical use of GATE to develop and deploy language processing capabilities in an application, and to generate quantitative results for scientific publication.

Let’s imagine that a developer called Fatima is building an email client11 for Cyberdyne Systems’ large corporate Intranet. In this application she would like to have a language processing system that automatically spots the names of people in the corporation and transforms them into mailto hyperlinks.

A little investigation shows that GATE’s existing components can be tailored to this purpose. Fatima starts up GATE Developer, and creates a new document containing some example emails. She then loads some processing resources that will do named-entity recognition (a tokeniser, gazetteer and semantic tagger), and creates an application to run these components on the document in sequence. Having processed the emails, she can see the results in one of several viewers for annotations.

The GATE components are a decent start, but they need to be altered to deal specially with people from Cyberdyne’s personnel database. Therefore Fatima creates new ‘cyber-’ versions of the gazetteer and semantic tagger resources, using the ‘bootstrap’ tool. This tool creates a directory structure on disk that has some Java stub code, a Makefile and an XML configuration file. After several hours struggling with badly written documentation, Fatima manages to compile the stubs and create a JAR file containing the new resources. She tells GATE Developer the URL of these files12, and the system then allows her to load them in the same way that she loaded the built-in resources earlier on.

Fatima then creates a second copy of the email document, and uses the annotation editing facilities to mark up the results that she would like to see her system producing. She saves this and the version that she ran GATE on into her serial datastore. From now on she can follow this routine:

  1. Run her application on the email test corpus.
  2. Check the performance of the system by running the ‘annotation diff’ tool to compare her manual results with the system’s results. This gives her both percentage accuracy figures and a graphical display of the differences between the machine and human outputs.
  3. Make edits to the code, pattern grammars or gazetteer lists in her resources, and recompile where necessary.
  4. Tell GATE Developer to re-initialise the resources.
  5. Go to 1.

To make the alterations that she requires, Fatima re-implements the ANNIE gazetteer so that it regenerates itself from the local personnel data. She then alters the pattern grammar in the semantic tagger to prioritise recognition of names from that source. This latter job involves learning the JAPE language (see Chapter 8), but as this is based on regular expressions it isn’t too difficult.

Eventually the system is running nicely, and her accuracy is 93% (there are still some problem cases, e.g. when people use nicknames, but the performance is good enough for production use). Now Fatima stops using GATE Developer and works instead on embedding the new components in her email application using GATE Embedded. This application is written in Java, so embedding is very easy13: the GATE JAR files are added to the project CLASSPATH, the new components are placed on a web server, and with a little code to do initialisation, loading of components and so on, the job is finished in half a day – the code to talk to GATE takes up only around 150 lines of the eventual application, most of which is just copied from the example in the sheffield.examples.StandAloneAnnie class.

Because Fatima is worried about Cyberdyne’s unethical policy of developing Skynet to help the large corporates of the West strengthen their strangle-hold over the World, she wants to get a job as an academic instead (so that her conscience will only have to cope with the torture of students, as opposed to humanity). She takes the accuracy measures that she has attained for her system and writes a paper for the Journal of Nasturtium Logarithm Incitement describing the approach used and the results obtained. Because she used GATE for development, she can cite the repeatability of her experiments and offer access to example binary versions of her software by putting them on an external web server.

And everybody lived happily ever after.

1.4 Some Evaluations [#]

This section contains an incomplete list of publications describing systems that used GATE in competitive quantitative evaluation programmes. These programmes have had a significant impact on the language processing field and the widespread presence of GATE is some measure of the maturity of the system and of our understanding of its likely performance on diverse text processing tasks.

[Li et al. 07d]
describes the performance of an SVM-based learning system in the NTCIR-6 Patent Retrieval Task. The system achieved the best result on two of three measures used in the task evaluation, namely the R-Precision and F-measure. The system obtained close to the best result on the remaining measure (A-Precision).
[Saggion 07]
describes a cross-source coreference resolution system based on semantic clustering. It uses GATE for information extraction and the SUMMA system to create summaries and semantic representations of documents. One system configuration ranked 4th in the Web People Search 2007 evaluation.
[Saggion 06]
describes a cross-lingual summarization system which uses SUMMA components and the Arabic plugin available in GATE to produce summaries in English from a mixture of English and Arabic documents.
Open-Domain Question Answering:
The University of Sheffield has a long history of research into open-domain question answering. GATE has formed the basis of much of this research resulting in systems which have ranked highly during independent evaluations since 1999. The first successful question answering system developed at the University of Sheffield was evaluated as part of TREC 8 and used the LaSIE information extraction system (the forerunner of ANNIE) which was distributed with GATE [Humphreys et al. 99]. Further research was reported in [Scott & Gaizauskas. 00], [Greenwood et al. 02], [Gaizauskas et al. 03], [Gaizauskas et al. 04] and [Gaizauskas et al. 05]. In 2004 the system was ranked 9th out of 28 participating groups.
[Saggion 04]
describes techniques for answering definition questions. The system uses definition patterns manually implemented in GATE as well as learned JAPE patterns induced from a corpus. In 2004, the system was ranked 4th in the TREC/QA evaluations.
[Saggion & Gaizauskas 04b]
describes a multidocument summarization system implemented using summarization components compatible with GATE (the SUMMA system). The system was ranked 2nd in the Document Understanding Evaluation programmes.
[Maynard et al. 03e] and [Maynard et al. 03d]
describe participation in the TIDES surprise language program. ANNIE was adapted to Cebuano with four person days of effort, and achieved an F-measure of 77.5%. Unfortunately, ours was the only system participating!
[Maynard et al. 02b] and [Maynard et al. 03b]
describe results obtained on systems designed for the ACE task (Automatic Content Extraction). Although a comparison to other participating systems cannot be revealed due to the stipulations of ACE, results show 82%-86% precision and recall.
[Humphreys et al. 98]
describes the LaSIE-II system used in MUC-7.
[Gaizauskas et al. 95]
describes the LaSIE-II system used in MUC-6.

1.5 Changes in this Version [#]

This section logs changes in the latest version of GATE. Appendix A provides a complete change log.

1.5.1 Version 5.1 (December 2009) [#]

Version 5.1 is a major increment with lots of new features and integration of a number of important systems from 3rd parties (e.g. LingPipe, OpenNLP, OpenCalais, a revised UIMA connector). We’ve stuck with the 5 series (instead of jumping to 6.0) because the core remains stable and backwards compatible.

Other highlights include:

New Features

LingPipe Support

LingPipe is a suite of Java libraries for the linguistic analysis of human language. We have provided a plugin called ‘LingPipe’ with wrappers for some of the resources available in the LingPipe library. For more details, see the section 19.15.

OpenNLP Support

OpenNLP provides tools for sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference. The tools use Maximum Entropy modelling. We have provided a plugin called ‘OpenNLP’ with wrappers for some of the resources available in the OpenNLP Tools library. For more details, see section 19.16.

OpenCalais Support

We added a new PR called ‘OpenCalais PR’. This will process a document through the OpenCalais service, and add OpenCalais entity annotations to the document. For more details, see Section 19.14.

Ontology API

The ontology API (package gate.creole.ontology has been changed, the existing ontology implementation based on Sesame1 and OWLIM2 (package gate.creole.ontology.owlim) has been moved into the plugin Ontology_OWLIM2. An upgraded implementation based on Sesame2 and OWLIM3 that also provides a number of new features has been added as plugin Ontology. See Section 14.12 for a detailed description of all changes.

Benchmarking Improvements

A number of improvements to the benchmarking support in GATE. JAPE transducers now log the time spent in individual phases of a multi-phase grammar and by individual rules within each phase. Other PRs that use JAPE grammars internally (the pronominal coreferencer, English tokeniser) log the time taken by their internal transducers. A reporting tool, called ‘Profiling Reports’ under the ‘Tools’ menu makes summary information easily available. For more details, see chapter 11.

GUI improvements

To deal with quality assurance of annotations, one component has been updated and two new components have been added. The annotation diff tool has a new mode to copy annotations to a consensus set, see section 10.2.1. An annotation stack view has been added in the document editor and it allows to copy annotations to a consensus set, see section 3.4.3. A corpus view has been added for all corpus to get statistics like precision, recall and F-measure, see section 10.3.

An annotation stack view has been added in the document editor to make easier to see overlapping annotations, see section 3.4.3.

ABNER Support

ABNER is A Biomedical Named Entity Recogniser, for finding entities such as genes in text. We have provided a plugin called ‘AbnerTagger’ with a wrapper for ABNER. For more details, see section 17.6.

Generic Tagger Support

A new plugin has been added to provide an easy route to integrate taggers with GATE. The Tagger_Framework plugin provides examples of incorporating a number of external taggers which should serve as a starting point for using other taggers. See Section 17.4 for more details.

Section-by-Section Processing

We have added a new PR called ‘Segment Processing PR’. As the name suggests this PR allows processing individual segments of a document independently of one other. For more details, please look at the section 16.2.8.

Application Composition

The gate.Controller implementations provided with the main GATE distribution now also implement the gate.ProcessingResource interface. This means that an application can now contain another application as one of its components.

Groovy Support

Groovy is a dynamic programming language based on Java. You can now use it as a scripting language for GATE, via the Groovy Console. For more details, see Section 7.15.

JAPE improvements

GATE now produces a warning when any Java right-hand-sides in JAPE rules make use of the deprecated annotations parameter. All bundled JAPE grammars have been updated to use the replacement inputAS and outputAS parameters as appropriate.

The new Imports: statement at the beginning of a JAPE grammar file can now be used to make additional Java import statements available to the Java RHS code, see 8.6.5.

The JAPE debugger has been removed. Debugging of JAPE has been made easier as stack traces now refer to the JAPE source file and line numbers instead of the generated Java source code.

The Montreal Transducer has been made obsolete.

Other improvements and bug fixes

Plugin names have been rationalised. Mappings exist so that existing applications will continue to work, but the new names should be used in the future. Plugin name mappings are given in Appendix B. Also, the Segmenter_Chinese plugin (used to be known as chineseSegmenter plugin) is now part of the Lang_Chinese plugin.

The User Guide has been amalgamated with the Programmer’s Guide; all material can now be found in the User Guide. The ‘How-To’ chapter has been converted into separate chapters for installation, GATE Developer and GATE Embedded. Other material has been relocated to the appropriate specialist chapter.

Made Mac OS launcher 64-bit compatible. See section 2.2.1 for details.

The UIMA integration layer (Chapter 18) has been upgraded to work with Apache UIMA 2.2.2.

Oracle and PostGreSQL are no longer supported.

The MIAKT Natural Language Generation plugin has been removed.

The Minorthird plugin has been removed. Minorthird has changed significantly since this plugin was written. We will consider writing an up-to-date Minorthird plugin in the future.

A new gazetteer, Large KB Gazetteer (in the plugin ‘Gazetteer_LKB’) has been added, see Section 13.9 for details.

gate.creole.tokeniser.chinesetokeniser.ChineseTokeniser and related resources under the plugins/ANNIE/tokeniser/chinesetokeniser folder have been removed. Please refer to the Lang_Chinese plugin for resources related to the Chinese language in GATE.

Added an isInitialised() method to gate.Gate().

Added a parameter to the chemistry tagger PR (section 17.5) to allow it to operate on annotation sets other than the default one.

Plus many more smaller bugfixes...

1.6 Further Reading [#]

Lots of documentation lives on the GATE web server, including:

For more details about Sheffield University’s work in human language processing see the NLP group pages or A Definition and Short History of Language Engineering ([Cunningham 99a]). For more details about Information Extraction see IE, a User Guide or the GATE IE pages.

A list of publications on GATE and projects that use it (some of which are available on-line):

2009

[Bontcheva et al. 09]
is the ‘Human Language Technologies’ chapter of ‘Semantic Knowledge Management’ (John Davies, Marko Grobelnik and Dunja Mladeni eds.)
[Damljanovic et al. 09]
- to appear.
[Laclavik & Maynard 09]
reviews the current state of the art in email processing and communication research, focusing on the roles played by email in information management, and commercial and research efforts to integrate a semantic-based approach to email.
[Li et al. 09]
investigates two techniques for making SVMs more suitable for language learning tasks. Firstly, an SVM with uneven margins (SVMUM) is proposed to deal with the problem of imbalanced training data. Secondly, SVM active learning is employed in order to alleviate the difficulty in obtaining labelled training data. The algorithms are presented and evaluated on several Information Extraction (IE) tasks.

2008

[Agatonovic et al. 08]
presents our approach to automatic patent enrichment, tested in large-scale, parallel experiments on USPTO and EPO documents.
[Damljanovic et al. 08]
presents Question-based Interface to Ontologies (QuestIO) - a tool for querying ontologies using unconstrained language-based queries.
[Damljanovic & Bontcheva 08]
presents a semantic-based prototype that is made for an open-source software engineering project with the goal of exploring methods for assisting open-source developers and software users to learn and maintain the system without major effort.
[Della Valle et al. 08]
presents ServiceFinder.
[Li & Cunningham 08]
describes our SVM-based system and several techniques we developed successfully to adapt SVM for the specific features of the F-term patent classification task.
[Li & Bontcheva 08]
reviews the recent developments in applying geometric and quantum mechanics methods for information retrieval and natural language processing.
[Maynard 08]
investigates the state of the art in automatic textual annotation tools, and examines the extent to which they are ready for use in the real world.
[Maynard et al. 08a]
discusses methods of measuring the performance of ontology-based information extraction systems, focusing particularly on the Balanced Distance Metric (BDM), a new metric we have proposed which aims to take into account the more flexible nature of ontologically-based applications.
[Maynard et al. 08b]
investigates NLP techniques for ontology population, using a combination of rule-based approaches and machine learning.
[Tablan et al. 08]
presents the QuestIO system a natural language interface for accessing structured information, that is domain independent and easy to use without training.

2007

[Funk et al. 07a]
describes an ontologically based approach to multi-source, multilingual information extraction.
[Funk et al. 07b]
presents a controlled language for ontology editing and a software implementation, based partly on standard NLP tools, for processing that language and manipulating an ontology.
[Maynard et al. 07a]
proposes a methodology to capture (1) the evolution of metadata induced by changes to the ontologies, and (2) the evolution of the ontology induced by changes to the underlying metadata.
[Maynard et al. 07b]
describes the development of a system for content mining using domain ontologies, which enables the extraction of relevant information to be fed into models for analysis of financial and operational risk and other business intelligence applications such as company intelligence, by means of the XBRL standard.
[Saggion 07]
describes experiments for the cross-document coreference task in SemEval 2007. Our cross-document coreference system uses an in-house agglomerative clustering implementation to group documents referring to the same entity.
[Saggion et al. 07]
describes the application of ontology-based extraction and merging in the context of a practical e-business application for the EU MUSING Project where the goal is to gather international company intelligence and country/region information.
[Li et al. 07a]
introduces a hierarchical learning approach for IE, which uses the target ontology as an essential part of the extraction process, by taking into account the relations between concepts.
[Li et al. 07b]
proposes some new evaluation measures based on relations among classification labels, which can be seen as the label relation sensitive version of important measures such as averaged precision and F-measure, and presents the results of applying the new evaluation measures to all submitted runs for the NTCIR-6 F-term patent classification task.
[Li et al. 07c]
describes the algorithms and linguistic features used in our participating system for the opinion analysis pilot task at NTCIR-6.
[Li et al. 07d]
describes our SVM-based system and the techniques we used to adapt the approach for the specifics of the F-term patent classification subtask at NTCIR-6 Patent Retrieval Task.
[Li & Shawe-Taylor 07]
studies Japanese-English cross-language patent retrieval using Kernel Canonical Correlation Analysis (KCCA), a method of correlating linear relationships between two variables in kernel defined feature spaces.

2006

[Aswani et al. 06]
(Proceedings of the 5th International Semantic Web Conference (ISWC2006)) In this paper the problem of disambiguating author instances in ontology is addressed. We describe a web-based approach that uses various features such as publication titles, abstract, initials and co-authorship information.
[Bontcheva et al. 06a]
‘Semantic Annotation and Human Language Technology’, contribution to ‘Semantic Web Technology: Trends and Research’ (Davies, Studer and Warren, eds.)
[Bontcheva et al. 06b]
‘Semantic Information Access’, contribution to ‘Semantic Web Technology: Trends and Research’ (Davies, Studer and Warren, eds.)
[Bontcheva & Sabou 06]
presents an ontology learning approach that 1) exploits a range of information sources associated with software projects and 2) relies on techniques that are portable across application domains.
[Davis et al. 06]
describes work in progress concerning the application of Controlled Language Information Extraction - CLIE to a Personal Semantic Wiki - Semper- Wiki, the goal being to permit users who have no specialist knowledge in ontology tools or languages to semi-automatically annotate their respective personal Wiki pages.
[Li & Shawe-Taylor 06]
studies a machine learning algorithm based on KCCA for cross-language information retrieval. The algorithm is applied to Japanese-English cross-language information retrieval.
[Maynard et al. 06]
discusses existing evaluation metrics, and proposes a new method for evaluating the ontology population task, which is general enough to be used in a variety of situation, yet more precise than many current metrics.
[Tablan et al. 06a]
describes an approach that allows users to create and edit ontologies simply by using a restricted version of the English language. The controlled language described is based on an open vocabulary and a restricted set of grammatical constructs.
[Tablan et al. 06b]
describes the creation of linguistic analysis and corpus search tools for Sumerian, as part of the development of the ETCSL.
[Wang et al. 06]
proposes an SVM based approach to hierarchical relation extraction, using features derived automatically from a number of GATE-based open-source language processing tools.

2005

[Aswani et al. 05]
(Proceedings of Fifth International Conference on Recent Advances in Natural Language Processing (RANLP2005)) It is a full-featured annotation indexing and search engine, developed as a part of the GATE. It is powered with Apache Lucene technology and indexes a variety of documents supported by the GATE.
[Bontcheva 05]
presents the ONTOSUM system which uses Natural Language Generation (NLG) techniques to produce textual summaries from Semantic Web ontologies.
[Cunningham 05]
is an overview of the field of Information Extraction for the 2nd Edition of the Encyclopaedia of Language and Linguistics.
[Cunningham & Bontcheva 05]
is an overview of the field of Software Architecture for Language Engineering for the 2nd Edition of the Encyclopaedia of Language and Linguistics.
[Dowman et al. 05a]
(Euro Interactive Television Conference Paper) A system which can use material from the Internet to augment television news broadcasts.
[Dowman et al. 05b]
(World Wide Web Conference Paper) The Web is used to assist the annotation and indexing of broadcast news.
[Dowman et al. 05c]
(Second European Semantic Web Conference Paper) A system that semantically annotates television news broadcasts using news websites as a resource to aid in the annotation process.
[Li et al. 05a]
(Proceedings of Sheffield Machine Learning Workshop) describe an SVM based IE system which uses the SVM with uneven margins as learning component and the GATE as NLP processing module.
[Li et al. 05b]
(Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005)) uses the uneven margins versions of two popular learning algorithms SVM and Perceptron for IE to deal with the imbalanced classification problems derived from IE.
[Li et al. 05c]
(Proceedings of Fourth SIGHAN Workshop on Chinese Language processing (Sighan-05)) a system for Chinese word segmentation based on Perceptron learning, a simple, fast and effective learning algorithm.
[Polajnar et al. 05]
(University of Sheffield-Research Memorandum CS-05-10) User-Friendly Ontology Authoring Using a Controlled Language.
[Saggion & Gaizauskas 05]
describes experiments on content selection for producing biographical summaries from multiple documents.
[Ursu et al. 05]
(Proceedings of the 2nd European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies (EWIMT 2005))Digital Media Preservation and Access through Semantically Enhanced Web-Annotation.
[Wang et al. 05]
(Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005)) Extracting a Domain Ontology from Linguistic Resource Based on Relatedness Measurements.

2004

[Bontcheva 04]
(LREC 2004) describes lexical and ontological resources in GATE used for Natural Language Generation.
[Bontcheva et al. 04]
(JNLE) discusses developments in GATE in the early naughties.
[Cunningham & Scott 04a]
(JNLE) is the introduction to the above collection.
[Cunningham & Scott 04b]
(JNLE) is a collection of papers covering many important areas of Software Architecture for Language Engineering.
[Dimitrov et al. 04]
(Anaphora Processing) gives a lightweight method for named entity coreference resolution.
[Li et al. 04]
(Machine Learning Workshop 2004) describes an SVM based learning algorithm for IE using GATE.
[Maynard et al. 04a]
(LREC 2004) presents algorithms for the automatic induction of gazetteer lists from multi-language data.
[Maynard et al. 04b]
(ESWS 2004) discusses ontology-based IE in the hTechSight project.
[Maynard et al. 04c]
(AIMSA 2004) presents automatic creation and monitoring of semantic metadata in a dynamic knowledge portal.
[Saggion & Gaizauskas 04a]
describes an approach to mining definitions.
[Saggion & Gaizauskas 04b]
describes a sentence extraction system that produces two sorts of multi-document summaries; a general-purpose summary of a cluster of related documents and an entity-based summary of documents related to a particular person.
[Wood et al. 04]
(NLDB 2004) looks at ontology-based IE from parallel texts.

2003

[Bontcheva et al. 03]
(NLPXML-2003) looks at GATE for the semantic web.
[Cunningham et al. 03]
(Corpus Linguistics 2003) describes GATE as a tool for collaborative corpus annotation.
[Kiryakov 03]
(Technical Report) discusses semantic web technology in the context of multimedia indexing and search.
[Manov et al. 03]
(HLT-NAACL 2003) describes experiments with geographic knowledge for IE.
[Maynard et al. 03a]
(EACL 2003) looks at the distinction between information and content extraction.
[Maynard et al. 03c]
(Recent Advances in Natural Language Processing 2003) looks at semantics and named-entity extraction.
[Maynard et al. 03e]
(ACL Workshop 2003) describes NE extraction without training data on a language you don’t speak (!).
[Saggion et al. 03a]
(EACL 2003) discusses robust, generic and query-based summarisation.
[Saggion et al. 03b]
(Data and Knowledge Engineering) discusses multimedia indexing and search from multisource multilingual data.
[Saggion et al. 03c]
(EACL 2003) discusses event co-reference in the MUMIS project.
[Tablan et al. 03]
(HLT-NAACL 2003) presents the OLLIE on-line learning for IE system.
[Wood et al. 03]
(Recent Advances in Natural Language Processing 2003) discusses using parallel texts to improve IE recall.

2002

[Baker et al. 02]
(LREC 2002) report results from the EMILLE Indic languages corpus collection and processing project.
[Bontcheva et al. 02a]
(ACl 2002 Workshop) describes how GATE can be used as an environment for teaching NLP, with examples of and ideas for future student projects developed within GATE.
[Bontcheva et al. 02b]
(NLIS 2002) discusses how GATE can be used to create HLT modules for use in information systems.
[Bontcheva et al. 02c], [Dimitrov 02a] and [Dimitrov 02b]
(TALN 2002, DAARC 2002, MSc thesis) describe the shallow named entity coreference modules in GATE: the orthomatcher which resolves pronominal coreference, and the pronoun resolution module.
[Cunningham 02]
(Computers and the Humanities) describes the philosophy and motivation behind the system, describes GATE version 1 and how well it lived up to its design brief.
[Cunningham et al. 02]
(ACL 2002) describes the GATE framework and graphical development environment as a tool for robust NLP applications.
[Dimitrov 02a, Dimitrov et al. 02]
(DAARC 2002, MSc thesis) discuss lightweight coreference methods.
[Lal 02]
(Master Thesis) looks at text summarisation using GATE.
[Lal & Ruger 02]
(ACL 2002) looks at text summarisation using GATE.
[Maynard et al. 02a]
(ACL 2002 Summarisation Workshop) describes using GATE to build a portable IE-based summarisation system in the domain of health and safety.
[Maynard et al. 02c]
(AIMSA 2002) describes the adaptation of the core ANNIE modules within GATE to the ACE (Automatic Content Extraction) tasks.
[Maynard et al. 02d]
(Nordic Language Technology) describes various Named Entity recognition projects developed at Sheffield using GATE.
[Maynard et al. 02e]
(JNLE) describes robustness and predictability in LE systems, and presents GATE as an example of a system which contributes to robustness and to low overhead systems development.
[Pastra et al. 02]
(LREC 2002) discusses the feasibility of grammar reuse in applications using ANNIE modules.
[Saggion et al. 02b] and [Saggion et al. 02a]
(LREC 2002, SPLPT 2002) describes how ANNIE modules have been adapted to extract information for indexing multimedia material.
[Tablan et al. 02]
(LREC 2002) describes GATE’s enhanced Unicode support.

Older than 2002

[Maynard et al. 01]
(RANLP 2001) discusses a project using ANNIE for named-entity recognition across wide varieties of text type and genre.
[Bontcheva et al. 00] and [Brugman et al. 99]
(COLING 2000, technical report) describe a prototype of GATE version 2 that integrated with the EUDICO multimedia markup tool from the Max Planck Institute.
[Cunningham 00]
(PhD thesis) defines the field of Software Architecture for Language Engineering, reviews previous work in the area, presents a requirements analysis for such systems (which was used as the basis for designing GATE versions 2 and 3), and evaluates the strengths and weaknesses of GATE version 1.
[Cunningham et al. 00a], [Cunningham et al. 98a] and [Peters et al. 98]
(OntoLex 2000, LREC 1998) presents GATE’s model of Language Resources, their access and distribution.
[Cunningham et al. 00b]
(LREC 2000) taxonomises Language Engineering components and discusses the requirements analysis for GATE version 2.
[Cunningham et al. 00c] and [Cunningham et al. 99]
(COLING 2000, AISB 1999) summarise experiences with GATE version 1.
[Cunningham et al. 00d] and [Cunningham 99b]
(technical reports) document early versions of JAPE (superseded by the present document).
[Gambäck & Olsson 00]
(LREC 2000) discusses experiences in the Svensk project, which used GATE version 1 to develop a reusable toolbox of Swedish language processing components.
[Maynard et al. 00]
(technical report) surveys users of GATE up to mid-2000.
[McEnery et al. 00]
(Vivek) presents the EMILLE project in the context of which GATE’s Unicode support for Indic languages has been developed.
[Cunningham 99a]
(JNLE) reviewed and synthesised definitions of Language Engineering.
[Stevenson et al. 98] and [Cunningham et al. 98b]
(ECAI 1998, NeMLaP 1998) report work on implementing a word sense tagger in GATE version 1.
[Cunningham et al. 97b]
(ANLP 1997) presents motivation for GATE and GATE-like infrastructural systems for Language Engineering.
[Cunningham et al. 96a]
(manual) was the guide to developing CREOLE components for GATE version 1.
[Cunningham et al. 96b]
(TIPSTER) discusses a selection of projects in Sheffield using GATE version 1 and the TIPSTER architecture it implemented.
[Cunningham et al. 96c, Cunningham et al. 96d, Cunningham et al. 95]
(COLING 1996, AISB Workshop 1996, technical report) report early work on GATE version 1.
[Gaizauskas et al. 96a]
(manual) was the user guide for GATE version 1.
[Gaizauskas et al. 96b, Cunningham et al. 97a, Cunningham et al. 96e]
(ICTAI 1996, TIPSTER 1997, NeMLaP 1996) report work on GATE version 1.
[Humphreys et al. 96]
(manual) describes the language processing components distributed with GATE version 1.
[Cunningham 94, Cunningham et al. 94]
(NeMLaP 1994, technical report) argue that software engineering issues such as reuse, and framework construction, are important for language processing R&D.

Chapter 2
Installing and Running GATE [#]

2.1 Downloading GATE [#]

To download GATE point your web browser at http://gate.ac.uk/download/.

2.2 Installing and Running GATE [#]

GATE will run anywhere that supports Java 5 or later, including Solaris, Linux, Mac OS X and Windows platforms. We don’t run tests on other platforms, but have had reports of successful installs elsewhere.

2.2.1 The Easy Way [#]

The easy way to install is to use one of the platform-specific installers (created using the excellent IzPack). Download a ‘platform-specific installer’ and follow the instructions it gives you. Once the installation is complete, you can start GATE Developer using gate.exe (Windows) or GATE.app (Mac) in the top-level installation directory, or gate.sh in the bin directory (other platforms).

Note for Mac users: on 64-bit-capable systems, GATE.app will run as a 64-bit application. It will use the first listed 64-bit JVM in your Java Preferences, even if your highest priority JVM is a 32-bit one. Thus if you want to run using Java 5 rather than 6 you must ensure that “J2SE 5.0 64-bit” is listed ahead of “Java SE 6 64-bit”.

2.2.2 The Hard Way (1) [#]

Download the Java-only release package or the binary build snapshot, and follow the instructions below.

Prerequisites:

Using the binary distribution:

The Ant scripts that start GATE Developer (ant.bat or ant) require you to set the JAVA_HOME environment variable to point to the top level directory of your JAVA installation. The value of GATE_CONFIG is passed to the system by the scripts using either a -i command-line option, or the Java property gate.config.

2.2.3 The Hard Way (2): Subversion [#]

The GATE code is maintained in a Subversion repository. You can use a Subversion client to check out the source code – the most up-to-date version of GATE is the trunk:
svn checkout https://gate.svn.sourceforge.net/svnroot/gate/gate/trunk gate

Once you have checked out the code you can build GATE using Ant (see Section 2.5)

You can browse the complete Subversion repository online at http://gate.svn.sourceforge.net/gate.

2.3 Using System Properties with GATE [#]

During initialisation, GATE reads several Java system properties in order to decide where to find its configuration files.

Here is a list of the properties used, their default values and their meanings:

gate.home
sets the location of the GATE install directory. This should point to the top level directory of your GATE installation. This is the only property that is required. If this is not set, the system will display an error message and them it will attempt to guess the correct value.
gate.plugins.home
points to the location of the directory containing installed plugins (a.k.a. CREOLE directories). If this is not set then the default value of {gate.home}/plugins is used.
gate.site.config
points to the location of the configuration file containing the site-wide options. If not set this will default to {gate.home}/gate.xml. The site configuration file must exist!
gate.user.config
points to the file containing the user’s options. If not specified, or if the specified file does not exist at startup time, the default value of gate.xml (.gate.xml on Unix platforms) in the user’s home directory is used.
gate.user.session
points to the file containing the user’s saved session. If not specified, the default value of gate.session (.gate.session on Unix) in the user’s home directory is used. When starting up GATE Developer, the session is reloaded from this file if it exists, and when exiting GATE Developer the session is saved to this file (unless the user has disabled ‘save session on exit’ in the configuration dialog). The session is not used when using GATE Embedded.
load.plugin.path
is a path-like structure, i.e. a list of URLs separated by ‘;’. All directories listed here will be loaded as CREOLE plugins during initialisation. This has similar functionality with the the -d command line option.
gate.builtin.creole.dir
is a URL pointing to the location of GATE’s built-in CREOLE directory. This is the location of the creole.xml file that defines the fundamental GATE resource types, such as documents, document format handlers, controllers and the basic visual resources that make up GATE. The default points to a location inside gate.jar and should not generally need to be overridden.

When using GATE Embedded, you can set the values for these properties before you call Gate.init(). Alternatively, you can set the values programmatically using the static methods setGateHome(), setPluginsHome(), setSiteConfigFile(), etc. before calling Gate.init(). See the Javadoc documentation for details. If you want to set these values from the command line you can use the following syntax for setting gate.home for example:

java -Dgate.home=/my/new/gate/home/directory -cp... gate.Main

When running GATE Developer, you can set the properties by creating a file build.properties in the top level GATE directory. In this file, any system properties which are prefixed with ‘run.’ will be passed to GATE. For example, to set an alternative user config file, put the following line in build.properties1:

run.gate.user.config=${user.home}/alternative-gate.xml

This facility is not limited to the GATE-specific properties listed above, for example the following line changes the default temporary directory for GATE (note the use of forward slashes, even on Windows platforms):

run.java.io.tmpdir=d:/bigtmp

2.4 Configuring GATE [#]

When GATE Developer is started, or when Gate.init() is called from GATE Embedded, GATE loads various sorts of configuration data stored as XML in files generally called something like gate.xml or .gate.xml. This data holds information such as:

This type of data is stored at two levels (in order from general to specific):

Where configuration data appears on several different levels, the more specific ones overwrite the more general. This means that you can set defaults for all GATE users on your system, for example, and allow individual users to override those defaults without interfering with others.

Configuration data can be set from the GATE Developer GUI via the ‘Options’ menu then ‘Configuration’. The user can change the appearance of the GUI in the ‘Appearance’ tab, which includes the options of font and the ‘look and feel’. The ‘Advanced’ tab enables the user to include annotation features when saving the document and preserving its format, to save the selected Options automatically on exit, and to save the session automatically on exit. The ‘Input Methods’ submenu from the ‘Options’ menu enables the user to change the default language for input. These options are all stored in the user’s .gate.xml file.

When using GATE Embedded, you can also set the site config location using Gate.setSiteConfigFile(File) prior to calling Gate.init().

2.5 Building GATE [#]

Note that you don’t need to build GATE unless you’re doing development on the system itself.

Prerequisites:

GATE now includes a copy of the ANT build tool which can be accessed through the scripts included in the bin directory (use ant.bat for Windows 98 or ME, ant.cmd for Windows NT, 2000 or XP, and ant.sh for Unix platforms).

To build gate, cd to gate and:

  1. Type:
    bin/ant
  2. [optional] To test the system:
    bin/ant test
  3. [optional] To make the Javadoc documentation:
    bin/ant doc
  4. You can also run GATE Developer using Ant, by typing:
    bin/ant run
  5. To see a full list of options type: bin/ant help

(The details of the build process are all specified by the build.xml file in the gate directory.)

You can also use a development environment like Borland JBuilder (click on the gate.jpx file), but note that it’s still advisable to use ant to generate documentation, the jar file and so on. Also note that the run configurations have the location of a gate.xml site configuration file hard-coded into them, so you may need to change these for your site.

2.5.1 Using GATE with Maven or JPF [#]

This section is based on contributions by Georg Öttl and William Oberman.

To use GATE with Maven you need a definition of the dependencies in POM format. There’s an example POM here.

To use GATE with JPF (a Java plugin framework) you need a plugin definition like this one.

2.6 Uninstalling GATE [#]

If you have used the installer, run:

java -jar uninstaller.jar

or just delete the whole of the installation directory (the one containing bin, lib, Uninstaller, etc.). The installer doesn’t install anything outside this directory, but for completeness you might also want to delete the settings files GATE creates in your home directory (.gate.xml and .gate.session).

2.7 Troubleshooting [#]

2.7.1 I don’t see the Java console messages under Windows [#]

Note that the gate.bat script uses javaw.exe to run GATE which means that you will see no console for the java process. If you have problems starting GATE and you would like to be able to see the console to check for messages then you should edit the gate.bat script and replace javaw.exe with java.exe in the definition of the JAVA environment variable.

2.7.2 When I execute GATE, nothing happens [#]

You might get some clues if you start GATE from the command line, using:

bin/ant -Druntime.spawn=false run

which will allow you to see all error messages GATE generates.

2.7.3 On Ubuntu, GATE is very slow or doesn’t start [#]

GATE and many other Java applications are known not to work with GCJ, the open-source Java SDK or others non SUN Java SDK.

Make sure you have the official version of Java installed. Provided by Sun, the package is named ‘sun-java6-jdk’ in Synaptic. GATE also works with Java version 5 so ‘sun-java5-jdk’.

To install it, run in a terminal:

sudo apt-get install sun-java6-jdk

Make sure that your default Java version is the one from SUN. You can do this by running:

sudo update-java-alternatives -l

This will list the installed Java VMs. You should see ‘java-6-sun’ as one of the options.

Then you should run :

sudo update-java-alternatives -s java-6-sun

to set the ‘java-6-sun’ as your default.

Finally, try GATE again.

2.7.4 How to use GATE on a 64 bit system? [#]

32-bit vs. 64-bit is a matter of the JVM rather than the build of GATE -

For example, on Mac OS X, either use Applications/Utilities/Java Preferences and put one of the 64-bit options at the top of the list, or run GATE from the terminal using Java 1.6.0 (which is 64-bit only on Mac OS):

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home  
 
bin/ant run

2.7.5 I got the error: Could not reserve enough space for object heap [#]

GATE doesn’t use the JAVA_OPTS variable. The default memory allocations are defined in the gate/build.xml file but you can override them by creating a file called build.properties in the same directory containing

runtime.start.memory=256m  
runtime.max.memory=1048m

2.7.6 I got the error: log4j:WARN No appenders could be found for logger... [#]

You need to copy the ‘gate/bin/log4j.properties’ file to the directory from which you execute your project.

2.7.7 Text is incorrectly refreshed after scrolling and become unreadable [#]

Change the look and feel used in GATE with menu ‘Options’ then ‘Configuration’. Restart GATE and try again. We use mainly ‘Metal’ and ‘Nimbus’ without problem.

Change the video driver you use.

Update Java.

Chapter 3
Using GATE Developer [#]

‘The law of evolution is that the strongest survives!’

‘Yes; and the strongest, in the existence of any social species, are those who are most social. In human terms, most ethical. …There is no strength to be gained from hurting one another. Only weakness.’

The Dispossessed [p.183], Ursula K. le Guin, 1974.

This chapter introduces GATE Developer, which is the GATE graphical user interface. It is analogous to systems like Mathematica for mathematicians, or Eclipse for Java programmers, providing a convenient graphical environment for research and development of language processing software. As well as being a powerful research tool in its own right, it is also very useful in conjunction with GATE Embedded (the GATE API by which GATE functionality can be included in your own applications); for example, GATE Developer can be used to create applications that can then be embedded via the API. This chapter describes how to complete common tasks using GATE Developer. It is intended to provide a good entry point to GATE functionality, and so explanations are given assuming only basic knowledge of GATE. However, probably the best way to learn how to use GATE Developer is to use this chapter in conjunction with the demonstrations and tutorials movies. There are specific links to them throughout the chapter.

The basic business of GATE is annotating documents, and all the functionality we will introduce relates to that. Core concepts are;

What is considered to be the end result of the process varies depending on the task, but for the purposes of this chapter, output takes the form of the annotated document/corpus. Researchers might be more interested in figures demonstrating how successfully their application compares to a ‘gold standard’ annotation set; Chapter 10 in Part II will cover ways of comparing annotation sets to each other and obtaining measures such as F1. Implementers might be more interested in using the annotations programmatically; Chapter 7, also in Part II, talks about working with annotations from GATE Embedded. For the purposes of this chapter, however, we will focus only on creating the annotated documents themselves, and creating GATE applications for future use.

GATE includes a complete information extraction system that you are free to use, called ANNIE (a Nearly-New Information Extraction System). Many users find this is a good starting point for their own application, and so we will cover it in this chapter. Chapter 6 talks in a lot more detail about the inner workings of ANNIE, but we aim to get you started using ANNIE from inside of GATE Developer in this chapter.

We start the chapter with an exploration of the GATE Developer GUI, in Section 3.1. We describe how to create documents (Section 3.2) and corpora (Section 3.3). We talk about viewing and manually creating annotations (Section 3.4).

We then talk about loading the plugins that contain the processing resources you will use to construct your application, in Section 3.5. We then talk about instantiating processing resources (Section 3.6). Section 3.7 covers applications, including using ANNIE (Section 3.7.2). Saving applications and language resources (documents and corpora) is covered in Section 3.8. We conclude with a few assorted topics that might be useful to the GATE Developer user, in Section 3.10.

3.1 The GATE Developer Main Window [#]


PIC


Figure 3.1: Main Window of GATE Developer


Figure 3.1 shows the main window of GATE Developer, as you will see it when you first run it. There are five main areas:

  1. at the top, the menus bar and tools bar with menus ‘File’, ‘Options’, ‘Tools’, ‘Help’ and icons for the most frequently used actions;
  2. on the left side, a tree starting from ‘GATE’ and containing ‘Applications’, ‘Language Resources’ etc. – this is the resources tree;
  3. in the bottom left corner, a rectangle, which is the small resource viewer;
  4. in the center, containing tabs with ‘Messages’ or the name of a resource from the resources tree, the main resource viewer;
  5. at the bottom, the messages bar.

The menu and the messages bar do the usual things. Longer messages are displayed in the messages tab in the main resource viewer area.

The resource tree and resource viewer areas work together to allow the system to display diverse resources in various ways. The many resources integrated with GATE can have either a small view, a large view, or both.

At any time, the main viewer can also be used to display other information, such as messages, by clicking on the appropriate tab at the top of the main window. If an error occurs in processing, the messages tab will flash red, and an additional popup error message may also occur.

In the options dialogue from the Options menu you can choose if you want to link the selection in the resources tree and the selected main view.

3.2 Loading and Viewing Documents [#]


PIC


Figure 3.2: Making a New Document


If you right-click on ‘Language Resources’ in the resources pane, select “New’ then ‘GATE Document’, the window ‘Parameters for the new GATE Document’ will appear as shown in figure 3.2. Here, you can specify the GATE document to be created. Required parameters are indicated with a tick. The name of the document will be created for you if you do not specify it. Enter the URL of your document or use the file browser to indicate the file you wish to use for your document source. For example, you might use ‘http://www.gate.ac.uk’, or browse to a text or XML file you have on disk. Click on ‘OK’ and a GATE document will be created from the source you specified.

See also the movie for creating documents.


PIC


Figure 3.3: The Document Editor


The document editor is contained in the central tabbed pane in GATE Developer. Double-click on your document in the resources pane to view the document editor. The document editor consists of a top panel with buttons and icons that control the display of different views and the search box. Initially, you will see just the text of your document, as shown in figure 3.3. Click on ‘Annotation Sets’ and ‘Annotations List’ to view the annotation sets to the right and the annotations list at the bottom. You will see a view similar to figure 3.4. In place of the annotations list, you can also choose to see the annotations stack. In place of the annotation sets, you can also choose to view the co-reference editor. More information about this functionality is given in Section 3.4.


PIC


Figure 3.4: The Document Editor with Annotation Sets and Annotations List


Text in a loaded document can be edited in the document viewer. The usual platform specific cut, copy and paste keyboard shortcuts should also work, depending on your operating system (e.g. CTRL-C, CTRL-V for Windows). The last icon, a magnifying glass, at the top of the document editor is for searching in the document. To prevent the new annotation windows popping up when a piece of text is selected, hold down the CTRL key. Alternatively, you can hide the annotation sets view by clicking on its button at the top of the document view; this will also cause the highlighted portions of the text to become un-highlighted.

You can set the document editor to be read-only in the options dialogue from the ‘Options’ menu. If enabled, you won’t be able to edit the text but you will still be able to edit annotations.

Another options is to choose if the insertion when editing text should be before or after the caret.

See also Section 16.2.2 for the compound document editor.

3.3 Creating and Viewing Corpora [#]

You can create a new corpus in a similar manner to creating a new document; simply right-click on ‘Language Resources’ in the resources pane, select ‘New’ then ‘GATE corpus’. A brief dialogue box will appear in which you can optionally give a name for your corpus (if you leave this blank, a corpus name will be created for you) and optionally add documents to the corpus from those already loaded into GATE.

There are three ways of adding documents to a corpus:

  1. When creating the corpus, clicking on the icon next to the “documentsList” input field brings up a popup window with a list of the documents already loaded into GATE Developer. This enables the user to add any documents to the corpus.
  2. Alternatively, the corpus can be loaded first, and documents added later by double clicking on the corpus and using the + and - icons to add or remove documents to the corpus. Note that the documents must have been loaded into GATE Developer before they can be added to the corpus.
  3. Once loaded, the corpus can be populated by right clicking on the corpus and selecting ‘Populate’. With this method, documents do not have to have been previously loaded into GATE Developer, as they will be loaded during the population process. If you right-click on your corpus in the resources pane, you will see that you have the option to ‘Populate’ the corpus. If you select this option, you will see a dialogue box in which you can specify a directory in which GATE will search for documents. You can specify the extensions allowable; for example, XML or TXT. This will restrict the corpus population to only those documents with the extensions you wish to load. You can choose whether to recurse through the directories contained within the target directory or restrict the population to those documents contained in the top level directory. Click on ‘OK’ to populate your corpus. This option provides a quick way to create a GATE Corpus from a directory of documents.

Additionally, right-clicking on a loaded document in the tree and selecting the ‘New corpus with this document’ option creates a new transient corpus named Corpus for document name containing just this document.

See also the movie for creating and populating corpora.


PIC


Figure 3.5: Corpus Editor


Double click on your corpus in the resources pane to see the corpus editor, shown in figure 3.5. You will see a list of the documents contained within the corpus.

In the top left of the corpus editor, plus and minus buttons allow you to add documents to the corpus from those already loaded into GATE and remove documents from the corpus (note that removing a document from a corpus does not remove it from GATE).

Up and down arrows at the top of the view allow you to reorder the documents in the corpus. The rightmost button in the view opens the currently selected document in a document editor.

At the bottom, you will see that tabs entitled ‘Initialisation Parameters’ and ‘Corpus Quality Assurance’ are also available in addition to the corpus editor tab you are currently looking at. Clicking on the ‘Initialisation Parameters’ tab allows you to view the initialisation parameters for the corpus. The ‘Corpus Quality Assurance’ tab allows you to calculate agreement measures between the annotations in your corpus. Agreement measures are discussed in depth in Chapter 10. The use of corpus quality assurance is discussed in Section 10.3.

3.4 Working with Annotations [#]

In this section, we will talk in more detail about viewing annotations, as well as creating and editing them manually. As discussed in at the start of the chapter, the main purpose of GATE is annotating documents. Whilst applications can be used to annotate the documents entirely automatically, annotation can also be done manually, e.g. by the user, or semi-automatically, by running an application over the corpus and then correcting/adding new annotations manually. Section 3.4.5 focuses on manual annotation. In Section 3.6 we talk about running processing resources on our documents. We begin by outlining the functionality around viewing annotations, organised by the GUI area to which the functionality pertains.

3.4.1 The Annotation Sets View [#]

To view the annotation sets, click on the ‘Annotation Sets’ button at the top of the document editor, or use the F3 key (see Section 3.9 for more keyboard shortcuts). This will bring up the annotation sets viewer, which displays the annotation sets available and their corresponding annotation types.

The annotation sets view is displayed on the left part of the document editor. It’s a tree-like view with a root for each annotation set. The first annotation set in the list is always a nameless set. This is the default annotation set. You can see in figure 3.4 that there is a drop-down arrow with no name beside it. Other annotation sets on the document shown in figure 3.4 are ‘Key’ and ‘Original markups’. Because the document is an XML document, the original XML markup is retained in the form of an annotation set. This annotation set is expanded, and you can see that there are annotations for ‘TEXT’, ‘body’, ‘font’, ‘html’, ‘p’, ‘table’, ‘td’ and ‘tr’.

To display all the annotations of one type, tick its checkbox or use the space key. The text segments corresponding to these annotations will be highlighted in the main text window. To delete an annotation type, use the delete key. To change the color, use the enter key. There is a context menu for all these actions that you can display by right-clicking on one annotation type, a selection or an annotation set.

If you keep shift key pressed when you open the annotation sets view, GATE Developer will try to select any annotations that were selected in the previous document viewed (if any); otherwise no annotation will be selected.

Having selected an annotation type in the annotation sets view, hovering over an annotation in the main resource viewer or right-clicking on it will bring up a popup box containing a list of the annotations associated with it, from which one can select an annotation to view in the annotation editor, or if there is only one, the annotation editor for that annotation. Figure 3.6 shows the annotation editor.


PIC


Figure 3.6: The Annotation Editor


3.4.2 The Annotations List View [#]

To view the annotations and their features, click on the ‘Annotations list’ button at the top or bottom of the main window or use F4 key. The annotation list viewer will appear above or below the main text, respectively. It will only contain the annotations selected from the annotation sets. These lists can be sorted in ascending and descending order by any column, by clicking on the corresponding column heading. Moreover you can hide a column by using the context menu with right-click. Clicking on an entry in the table will also highlight the respective matching text portion. Right-click on a row in this view to delete or edit an annotation.

3.4.3 The Annotations Stack View [#]


PIC


Figure 3.7: Annotations stack view centred on the document caret.


This view is similar to the annotations list view, but instead of displaying all the annotations of the document, it displays only annotations at the document caret position with some context before and after. The annotations are stacked from top to bottom, which gives a clear view when they are overlapping.

As the view is centred on the document caret, you can use the conventional keypresses to move it and update the view: notably the keys left and right to skip one letter; control + left/right to skip one word; up and down to go one line up or down; and use the document scrollbar then click in the document to move further. There are also two buttons at the top of the view that centre the view on the closest previous/next annotation boundary among all displayed. This is useful when you want to skip a region without annotation or when you want to reach the beginning or end of a very long annotation.

The annotation types displayed correspond to those selected in the annotation sets view. You can display feature values for an annotation rectangle by hovering the mouse on it or select only one feature to display by double-clicking on the annotation type in the first column.

Right-clicking on an annotation in the annotations stack view gives the option to edit that annotation.

3.4.4 The Co-reference Editor [#]


PIC


Figure 3.8: Co-reference editor inside a document editor. The popup window in the document under the word ‘EPSRC’ is used to add highlighted annotations to a co-reference chain. Here the annotation type ‘Organization’ of the annotation set ‘Default’ is highlighted and also the co-references ‘EC’ and ‘GATE’.


The co-reference editor allows co-reference chains (see Section 6.9) to be displayed and edited in GATE Developer. To display the co-reference editor, first open a document in GATE Developer, and then click on the Co-reference Editor button in the document viewer.

The combo box at the top of the co-reference editor allows you to choose which annotation set to display co-references for. If an annotation set contains no co-reference data, then the tree below the combo box will just show ‘Coreference Data’ and the name of the annotation set. However, when co-reference data does exist, a list of all the co-reference chains that are based on annotations in the currently selected set is displayed. The name of each co-reference chain in this list is the same as the text of whichever element in the chain is the longest. It is possible to highlight all the member annotations of any chain by selecting it in the list.

When a co-reference chain is selected, if the mouse is placed over one of its member annotations, then a pop-up box appears, giving the user the option of deleting the item from the chain. If the only item in a chain is deleted, then the chain itself will cease to exist, and it will be removed from the list of chains. If the name of the chain was derived from the item that was deleted, then the chain will be given a new name based on the next longest item in the chain.

A combo box near the top of the co-reference editor allows the user to select an annotation type from the current set. When the Show button is selected all the annotations of the selected type will be highlighted. Now when the mouse pointer is placed over one of those annotations, a pop-up box will appear giving the user the option of adding the annotation to a co-reference chain. The annotation can be added to an existing chain by typing the name of the chain (as shown in the list on the right) in the pop-up box. Alternatively, if the user presses the down cursor key, a list of all the existing annotations appears, together with the option [New Chain]. Selecting the [New Chain] option will cause a new chain to be created containing the selected annotation as its only element.

Each annotation can only be added to a single chain, but annotations of different types can be added to the same chain, and the same text can appear in more than one chain if it is referenced by two or more annotations.

The movie for inspecting results is also useful for learning about viewing annotations.

3.4.5 Creating and Editing Annotations [#]

To create annotations manually, select the text you want to annotate and hover the mouse on the selection. A popup will appear, allowing you to create an annotation, as shown in figure 3.9


PIC


Figure 3.9: Creating a New Annotation


The type of the annotation, by default, will be the same as the last annotation you created, unless there is none, in which case it will be ‘_New_’. You can enter any annotation type name you wish in the text box, unless you are using schema-driven annotation (see Section 3.4.6). You can add or change features and their values in the table below.

To delete an annotation, click on the red X icon at the top of the popup window. To grow/shrink the span of the annotation at its start use the two arrow icons on the left or right and left keys. Use the two arrow icons next on the right to change the annotation end or alt+right and alt+left keys. Add shift and control+shift keys to make the span increment bigger. The red X icon is for removing the annotation.

The pin icon is to pin the window so that it remains where it is. If you drag and drop the window, this automatically pins it too. Pinning it means that even if you select another annotation (by hovering over it in the main resource viewer) it will still stay in the same position.

The popup menu only contains annotation types present in the Annotation Schema and those already listed in the relevant Annotation Set. To create a new Annotation Schema, see Section 3.4.6. The popup menu can be edited to add a new annotation type, however.

The new annotation created will automatically be placed in the annotation set that has been selected (highlighted) by the user. To create a new annotation set, type the name of the new set to be created in the box below the list of annotation sets, and click on ‘New’.

Figure 3.10 demonstrates adding a ‘Organization’ annotation for the string ‘EPSRC’ (highlighted in green) to the default annotation set (blank name in the annotation set view on the right) and a feature name ‘type’ with a value about to be added.


PIC


Figure 3.10: Adding an Organization annotation to the Default Annotation Set


To add a second annotation to a selected piece of text, or to add an overlapping annotation to an existing one, press the CTRL key to avoid the existing annotation popup appearing, and then select the text and create the new annotation. Again by default the last annotation type to have been used will be displayed; change this to the new annotation type. When a piece of text has more than one annotation associated with it, on mouseover all the annotations will be displayed. Selecting one of them will bring up the relevant annotation popup.


PIC


Figure 3.11: Search and Annotate Function of the Annotation Editor.


To search and annotate the document automatically, use the search and annotate function as shown in figure 3.11:

Note that after using the [First] button you can move the caret in the document and use the [Next] button to avoid continuing the search from the beginning of the document. The [?] button at the end of the search text field will help you to build powerful regular expressions to search.

3.4.6 Schema-Driven Editing [#]

Annotation schemas allow annotation types and features to be pre-specified, so that during manual annotation, the relevant options appear on the drop-down lists in the annotation editor. You can see some example annotation schemas in Section 5.4.1. Annotation schemas provide a means to define types of annotations in GATE Developer. Basically this means that GATE Developer ‘knows about’ annotations defined in a schema.

Annotation schemas are supported by the ‘Annotation schema’ language resource in ANNIE, so to use them you must first ensure that the ‘ANNIE’ plugin is loaded (see Section 3.5). This will load a set of default schemas, as well as allowing you to load schemas of your own.

The default annotation schemas contain common named entities such as Person, Organisation, Location, etc. You can modify the existing schema or create a new one, in order to tell GATE Developer about other kinds of annotations you frequently use. You can still create annotations in GATE Developer without having specified them in an annotation schema, but you may then need to tell GATE Developer about the properties of that annotation type each time you create an annotation for it.

To load a schema of your own, right-click on ‘Language Resources’ in the resources pane. Select ‘New’ then ‘Annotation schema’. A popup box will appear in which you can browse to your annotation schema XML file.

An alternative annotation editor component is available which constrains the available annotation types and features much more tightly, based on the annotation schemas that are currently loaded. This is particularly useful when annotating large quantities of data or for use by less skilled users.

To use this, you must load the Schema_Annotation_Editor plugin. With this plugin loaded, the annotation editor will only offer the annotation types permitted by the currently loaded set of schemas, and when you select an annotation type only the features permitted by the schema are available to edit1. Where a feature is declared as having an enumerated type the available enumeration values are presented as an array of buttons, making it easy to select the required value quickly.

3.4.7 Printing Text with Annotations [#]

We suggest you to use your browser to print a document as GATE don’t propose a printing facility for the moment.

First save your document by right clicking on the document in the left resources tree then choose ‘Save Preserving Format’. You will get an XML file with all the annotations highlighted as XML tags plus the ‘Original markups’ annotations set.

Then add a stylesheet processing instruction at the beginning of the XML file:

<?xml version="1.0" encoding="UTF-8" ?>  
<?xml-stylesheet type="text/css" href="gate.css"?>

And create a file ‘gate.css’ in the same directory:

BODY, body { margin: 2em } /* or any other first level tag */  
P, p { display: block } /* or any other paragraph tag */  
/* ANNIE tags but you can use whatever tags you want */  
/* be careful that XML tags are case sensitive */  
Date         { background-color: rgb(230, 150, 150) }  
FirstPerson  { background-color: rgb(150, 230, 150) }  
Identifier   { background-color: rgb(150, 150, 230) }  
JobTitle     { background-color: rgb(150, 230, 230) }  
Location     { background-color: rgb(230, 150, 230) }  
Money        { background-color: rgb(230, 230, 150) }  
Organization { background-color: rgb(230, 200, 200) }  
Percent      { background-color: rgb(200, 230, 200) }  
Person       { background-color: rgb(200, 200, 230) }  
Title        { background-color: rgb(200, 230, 230) }  
Unknown      { background-color: rgb(230, 200, 230) }  
Etc          { background-color: rgb(230, 230, 200) }

Finally open the XML file in your browser and print it.

Note that overlapping annotations, cannot be expressed correctly with inline XML tags and thus won’t be displayed correctly.

3.5 Using CREOLE Plugins [#]

In GATE, processing resources are used to automatically create and manipulate annotations on documents. We will talk about processing resources in the next section. However, we must first introduce CREOLE plugins. In most cases, in order to use a particular processing resource (and certain language resources) you must first load the CREOLE plugin that contains it. This section talks about using CREOLE plugins. Then, in Section 3.6, we will talk about creating and using processing resources.

The definitions of CREOLE resources (e.g. processing resources such as taggers and parsers, see Chapter 4) are stored in CREOLE directories (directories containing an XML file describing the resources, the Java archive with the compiled executable code and whatever libraries are required by the resources).

Starting with version 3, CREOLE directories are called ‘CREOLE plugins’ or simply ‘plugins’. In previous versions, the CREOLE resources distributed with GATE used to be included in the monolithic gate.jar archive. Version 3 includes them as separate directories under the plugins directory of the distribution. This allows easy access to the linguistic resources used without the requirement to unpack the gate.jar file.

Plugins can have one or more of the following states in relation with GATE:

known
plugins are those plugins that the system knows about. These include all the plugins in the plugins directory of the GATE installation (the so–called installed plugins) as well all the plugins that were manually loaded from the user interface.
loaded
plugins are the plugins currently loaded in the system. All CREOLE resource types from the loaded plugins are available for use. All known plugins can easily be loaded and unloaded using the user interface.
auto-loadable
plugins are the list of plugins that the system loads automatically during initialisation.

The default location for installed plugins can be modified using the gate.plugins.home system property while the list of auto-loadable plugins can be set using the load.plugin.path property, see Section 2.3 above.

The CREOLE plugins can be managed through the graphical user interface which can be activated by selecting ‘Manage CREOLE Plugins’ from the ‘File’ menu. This will bring up a window listing all the known plugins. For each plugin there are two check-boxes – one labelled ‘Load now’, which will load the plugin, and the other labelled ‘Load always’ which will add the plugin to the list of auto-loadable plugins. A ‘Delete’ button is also provided – which will remove the plugin from the list of known plugins. This operation does not delete the actual plugin directory. Installed plugins are found automatically when GATE is started; if an installed plugin is deleted from the list, it will re-appear next time GATE is launched.


PIC


Figure 3.12: Plugin Management Console


If you select a plugin, you will see in the pane on the right the list of resources that plugin contains. For example, in figure 3.12, the ‘Alignment’ plugin is selected, and you can see that it contains ten processing resources; ‘Compound Document’, ‘Compound Document From Xml’, ‘Compound Document Editor’, ‘GATE Composite document’ etc. If you wish to use a particular resource you will have to ascertain which plugin contains it. This list can be useful for that. Alternatively, the GATE website provides a directory of plugins and their processing resources.

Having loaded the plugins you need, the resources they define will be available for use. Typically, to the GATE Developer user, this means that they will appear on the ‘New’ menu when you right-click on ‘Processing Resources’ in the resources pane, although some special plugins have different effects; for example, the Schema_Annotation_Editor (see Section 3.4.6).

3.6 Loading and Using Processing Resources [#]

This section describes how to load and run CREOLE resources not present in ANNIE. To load ANNIE, see Section 3.7.2. For technical descriptions of these resources, see the appropriate chapter in Part III (e.g. Chapter 19). First ensure that the necessary plugins have been loaded (see Section 3.5). If the resource you require does not appear in the list of Processing Resources, then you probably do not have the necessary plugin loaded. Processing resources are loaded by selecting them from the set of Processing Resources: right click on Processing Resources or select ‘New Processing Resource’ from the File menu.

For example, use the Plugin Console Manager to load the ‘Tools’ plugin. When you right click on ‘Processing Resources’ in the resources pane and select ‘New’ you have the option to create any of the processing resources that plugin provides. You may choose to create a ‘GATE Morphological Analyser’, with the default parameters. Having done this, an instance of the GATE Morphological Analyser appears under ‘Processing Resources’. This processing resource, or PR, is now available to use. Double-clicking on it in the resources pane reveals its initialisation parameters, see figure 3.13.


PIC


Figure 3.13: GATE Morphological Analyser Initialisation Parameters


This processing resource is now available to be added to applications. It must be added to an application before it can be applied to documents. You may create as many of a particular processing resource as you wish, for example with different initialisation parameters. Section 3.7 talks about creating and running applications.

See also the movie for loading processing resources.

3.7 Creating and Running an Application [#]

Once all the resources you need have been loaded, an application can be created from them, and run on your corpus. Right click on ‘Applications’ and select ‘New’ and then either ‘Corpus Pipeline’ or ‘Pipeline’. A pipeline application can only be run over a single document, while a corpus pipeline can be run over a whole corpus.

To build the pipeline, double click on it, and select the resources needed to run the application (you may not necessarily wish to use all those which have been loaded). Transfer the necessary components from the set of ‘loaded components’ displayed on the left hand side of the main window to the set of ‘selected components’ on the right, by selecting each component and clicking on the left and right arrows, or by double-clicking on each component. Ensure that the components selected are listed in the correct order for processing (starting from the top). If not, select a component and move it up or down the list using the up/down arrows at the left side of the pane. Ensure that any parameters necessary are set for each processing resource (by clicking on the resource from the list of selected resources and checking the relevant parameters from the pane below). For example, if you wish to use annotation sets other than the Default one, these must be defined for each processing resource. Note that if a corpus pipeline is used, the corpus needs only to be set once, using the drop-down menu beside the ‘corpus’ box. If a pipeline is used, the document must be selected for each processing resource used. Finally, right-click on ‘Run’ to run the application on the document or corpus.

See also the movie for loading and running processing resources.

For how to use the conditional versions of the pipelines see Section 3.7.1 and for saving/restoring the configuration of an application see Section 3.8.3.

3.7.1 Running PRs Conditionally on Document Features [#]

The ‘Conditional Pipeline’ and ‘Conditional Corpus Pipeline’ application types are conditional versions of the pipelines mentioned in Section 3.7 and allow processing resources to be run or not according to the value of a feature on the document. In terms of graphical interface, the only addition brought by the conditional versions of the applications is a box situated underneath the lists of available and selected resources which allows the user to choose whether the currently selected processing resource will run always, never or only on the documents that have a particular value for a named feature.

If the Yes option is selected then the corresponding resource will be run on all the documents processed by the application as in the case of non- conditional applications. If the No option is selected then the corresponding resource will never be run; the application will simply ignore its presence. This option can be used to temporarily and quickly disable an application component, for debugging purposes for example.

The If value of feature option permits running specific application components conditionally on document features. When selected, this option enables two text input fields that are used to enter the name of a feature and the value of that feature for which the corresponding processing resource will be run. When a conditional application is run over a document, for each component that has an associated condition, the value of the named feature is checked on the document and the component will only be used if the value entered by the user matches the one contained in the document features.

3.7.2 Doing Information Extraction with ANNIE [#]

This section describes how to load and run ANNIE (see Chapter 6) from GATE Developer. ANNIE is a good place to start because it provides a complete information extraction application, that you can run on any corpus. You can then view the effects.

From the File menu, select ‘Load ANNIE System’. To run it in its default state, choose ‘with Defaults’. This will automatically load all the ANNIE resources, and create a corpus pipeline called ANNIE with the correct resources selected in the right order, and the default input and output annotation sets.

If ‘without Defaults’ is selected, the same processing resources will be loaded, but a popup window will appear for each resource, which enables the user to specify a name and location for the resource. This is exactly the same procedure as for loading a processing resource individually, the difference being that the system automatically selects those resources contained within ANNIE. When the resources have been loaded, a corpus pipeline called ANNIE will be created as before.

The next step is to add a corpus (see Section 3.3), and select this corpus from the drop-down corpus menu in the Serial Application editor. Finally click on ‘Run’ from the Serial Application editor, or by right clicking on the application name in the resources pane and selecting ‘Run’. (Many people prefer to switch to the messages tab, then run their application by right-clicking on it in the resources pane, because then it is possible to monitor any messages that appear whilst the application is running.)

To view the results, double click on the filename in the left hand pane. No annotation sets nor annotations will be shown until annotations are selected in the annotation sets; the ‘Default’ set is indicated only with an unlabelled right-arrowhead which must be selected in order to make visible the available annotations. Open the default annotation set and select some of the annotations to see what the ANNIE application has done.

See also the movie for loading and running ANNIE.

3.7.3 Modifying ANNIE [#]

You will find the ANNIE resources in gate/plugins/ANNIE/resources. Simply locate the existing resources you want to modify, make a copy with a new name, edit them, and load the new resources into GATE as new Processing Resources (see Section 3.6).

3.8 Saving Applications and Language Resources [#]

In this section, we will describe how applications and language resources can be saved for use outside of GATE and for use with GATE at a later time. Section 3.8.1 talks about saving documents to file. Section 3.8.2 outlines how to use datastores. Section 3.8.3 talks about saving resource parameter states, and Section 3.8.4 talks about exporting applications.

3.8.1 Saving Documents to File [#]

There are three main ways to save annotated documents:

  1. preserving the original markup, with optional added annotations;
  2. in GATE’s own XML serialisation format (including all the annotations on the document);
  3. by writing your own dump algorithm as a processing resource.

This section describes how to use the first two options.

Both types of data export are available in the popup menu triggered by right-clicking on a document in the resources tree (see Section 3.1): type 1 is called ‘Save Preserving Format’ and type 2 is called ‘Save as XML’.

Selecting the save as XML option leads to a file open dialogue; give the name of the file you want to create, and the whole document and all its data will be exported to that file. If you later create a document from that file, the state will be restored. (Note: because GATE’s annotation model is richer than that of XML, and because our XML dump implementation sometimes cuts corners2, the state may not be identical after restoration. If your intention is to store the state for later use, use a DataStore instead.)

The ‘Save Preserving Format’ option also leads to a file dialogue; give a name and the data you require will be dumped into the file. The action can be used for documents that were created from files using the XML or HTML format. It will save all the original tags as well as the document annotations that are currently displayed in the ‘Annotations List’ view. This option is useful for selectively saving only some annotation types.

The annotations are saved as normal document tags, using the annotation type as the tag name. If the advanced option ‘Include annotation features for “Save Preserving Format”’ (see Section 2.4) is set to true, then the annotation features will also be saved as tag attributes.

Using this operation for GATE documents that were not created from an HTML or XML file results in a plain text file, with in-line tags for the saved annotations.

Note that GATE’s model of annotation allows graph structures, which are difficult to represent in XML (XML is a tree-structured representation format). During the dump process, annotations that cross each other in ways that cannot be represented in legal XML will be discarded, and a warning message printed.

3.8.2 Saving and Restoring LRs in Datastores [#]

Where corpora are large, the memory available may not be sufficient to have all documents open simultaneously. The datastore functionality provides the option to save documents to disk and open them only one at a time for processing. This means that much larger corpora can be used. A datastore can also be useful for saving documents in an efficient and lossless way.

To save a text in a datastore, a new datastore must first be created if one does not already exist. Create a datastore by right clicking on Datastore in the left hand pane, and select the option ‘Create Datastore’. Select the data store type you wish to use. Create a directory to be used as the datastore (note that the datastore is a directory and not a file).

You can either save a whole corpus to the datastore (in which case the structure of the corpus will be preserved) or you can save individual documents. The recommended method is to save the whole corpus. To save a corpus, right click on the corpus name and select the ‘Save to...’ option (giving the name of the datastore created earlier). To save individual documents to the datastore, right clicking on each document name and follow the same procedure.

To load a document from a datastore, do not try to load it as a language resource. Instead, open the datastore by right clicking on Datastore in the left hand pane, select ‘Open Datastore’ and choose the datastore to open. The datastore tree will appear in the main window. Double click on a corpus or document in this tree to open it. To save a corpus and document back to the same datastore, simply select the ‘Save’ option.

See also the movie for creating a datastore and the movie for loading corpus and documents from a datastore.

3.8.3 Saving Resource Parameter States to File [#]

Resources, and applications that are made up of them, are created based on the settings of their parameters (see Section 3.6). It is possible to save the data used to create an application to a file and re-load it later. To save the application to a file, right click on it in the resources tree and select ‘Save application state’, which will give you a file creation dialogue.

To restore the application later, select ‘Restore application from file’ from the ‘File’ menu.

Note that the data that is saved represents how to recreate an application – not the resources that make up the application itself. So, for example, if your application has a resource that initialises itself from some file (e.g. a grammar, a document) then that file must still exist when you restore the application.

In case you don’t want to save the corpus configuration associated with the application then you must select ‘<none>’ in the corpus list of the application before saving the application.

The file resulting from saving the application state contains the values of the initialisation parameters for all the processing resources contained by the stored application. For the parameters of type URL (which are typically used to select external resources such as grammars or rules files) a transformation is applied so that all the paths are relative to the location of the file used to store the state. This means that the resource files used by an application do not need to be in the same location as when the application was initially created but rather in the same location relative to the location of the application file. This allows the creation and deployment of portable applications by keeping the application file and the resource files used by the application together.

If you want to save your application along with all the resources it requires you can use the ‘Export for Teamware’ option (see Section 3.8.4).

See also the movie for saving and restoring applications.

3.8.4 Saving an Application with its Resources (e.g. GATE Teamware) [#]

When you save an application using the ‘Save application state’ option (see Section 3.8.3), the saved file contains references to the plugins that were loaded when the application was saved, and to any resource files required by the application. To be able to reload the file, these plugins and other dependencies must exist at the same locations (relative to the saved state file). While this is fine for saving and loading applications on a single machine it means that if you want to package your application to run it elsewhere (e.g. deploy it to a GATE Teamware installation) then you need to be careful to include all the resource files and plugins at the right locations in your package. The ‘Export for Teamware’ option on the right-click menu for an application helps to automate this process.

When you export an application in this way, GATE Developer produces a ZIP file containing the saved application state (in the same format as ‘Save application state’). Any plugins and resource files that the application refers to are also included in the zip file, and the relative paths in the saved state are rewritten to point to the correct locations within the package. The resulting package is therefore self-contained and can be copied to another machine and unpacked there, or passed to your Teamware Administrator for deployment.

As well as selecting the location where you want to save the package, the ‘Export for Teamware’ option will also prompt you to select the annotation sets that your application uses for input and output. For example, if your application makes use of the unpacked XML markup in source documents and creates annotations in the default set then you would select ‘Original markups’ as an input set and the ‘<Default annotation set>’ as an output set. GATE Developer will try to make an educated guess at the correct sets but you should check and amend the lists as necessary.

There are a few important points to note about the export process:

If you require more flexibility than this option provides you should read Section E.2, which describes the underlying Ant task that the exporter uses.

3.9 Keyboard Shortcuts [#]

You can use various keyboard shortcuts for common tasks in GATE Developer. These are listed in this section.

General (Section 3.1):

Resources tree (Section 3.1):

Document editor (Section 3.2):

Annotation editor (Section 3.4):

Annic/Lucene datastore (Chapter 9):

Annic/Lucene query text field (Chapter 9):

3.10 Miscellaneous [#]

3.10.1 Stopping GATE from Restoring Developer Sessions/Options [#]

GATE can remember Developer options and the state of the resource tree when it exits. The options are saved by default; the session state is not saved by default. This default behaviour can be changed from the ‘Advanced’ tab of the ‘Configuration’ choice on the ‘Options’ menu.

If a problem occurs and the saved data prevents GATE Developer from starting, you can fix this by deleting the configuration and session data files. These are stored in your home directory, and are called gate.xml and gate.sesssion or .gate.xml and .gate.sesssion depending on platform. On Windows your home is:

95, 98, NT:
Windows Directory/profiles/username
2000, XP:
Windows Drive/Documents and Settings/username

3.10.2 Working with Unicode [#]

GATE provides various facilities for working with Unicode beyond those that come as default with Java3:

  1. a Unicode editor with input methods for many languages;
  2. use of the input methods in all places where text is edited in the GUI;
  3. a development kit for implementing input methods;
  4. ability to read diverse character encodings.

1 using the editor:
In GATE Developer, select ‘Unicode editor’ from the ‘Tools’ menu. This will display an editor window, and, when a language with a custom input method is selected for input (see next section), a virtual keyboard window with the characters of the language assigned to the keys on the keyboard. You can enter data either by typing as normal, or with mouse clicks on the virtual keyboard.

2 configuring input methods:
In the editor and in GATE Developer’s main window, the ‘Options’ menu has an ‘Input methods’ choice. All supported input languages (a superset of the JDK languages) are available here. Note that you need to use a font capable of displaying the language you select. By default GATE Developer will choose a Unicode font if it can find one on the platform you’re running on. Otherwise, select a font manually from the ‘Options’ menu ‘Configuration’ choice.

3 using the development kit:
GUK, the GATE Unicode Kit, is documented at http://gate.ac.uk/gate/doc/javadoc/guk/package-summary.html.

4 reading different character encodings:
When you create a document from a URL pointing to textual data in GATE, you have to tell the system what character encoding the text is stored in. By default, GATE will set this parameter to be the empty string. This tells Java to use the default encoding for whatever platform it is running on at the time – e.g. on Western versions of Windows this will be ISO-8859-1, and Eastern ones ISO-8859-9. A popular way to store Unicode documents is in UTF-8, which is a superset of ASCII (but can still store all Unicode data); if you get an error message about document I/O during reading, try setting the encoding to UTF-8, or some other locally popular encoding. (To see a list of available encodings, try opening a document in GATE’s unicode editor – you will be prompted to select an encoding.)

Chapter 4
CREOLE: the GATE Component Model [#]

…Noam Chomsky’s answer in Secrets, Lies and Democracy (David Barsamian 1994; Odonian) to ‘What do you think about the Internet?’

‘I think that there are good things about it, but there are also aspects of it that concern and worry me. This is an intuitive response – I can’t prove it – but my feeling is that, since people aren’t Martians or robots, direct face-to-face contact is an extremely important part of human life. It helps develop self-understanding and the growth of a healthy personality.

‘You just have a different relationship to somebody when you’re looking at them than you do when you’re punching away at a keyboard and some symbols come back. I suspect that extending that form of abstract and remote relationship, instead of direct, personal contact, is going to have unpleasant effects on what people are like. It will diminish their humanity, I think.’

Chomsky, quoted at http://photo.net/wtr/dead-trees/53015.htm.

The GATE architecture is based on components: reusable chunks of software with well-defined interfaces that may be deployed in a variety of contexts. The design of GATE is based on an analysis of previous work on infrastructure for LE, and of the typical types of software entities found in the fields of NLP and CL (see in particular chapters 4–6 of [Cunningham 00]). Our research suggested that a profitable way to support LE software development was an architecture that breaks down such programs into components of various types. Because LE practice varies very widely (it is, after all, predominantly a research field), the architecture must avoid restricting the sorts of components that developers can plug into the infrastructure. The GATE framework accomplishes this via an adapted version of the Java Beans component framework from Sun, as described in section 4.2.

GATE components may be implemented by a variety of programming languages and databases, but in each case they are represented to the system as a Java class. This class may do nothing other than call the underlying program, or provide an access layer to a database; on the other hand it may implement the whole component.

GATE components are one of three types:

The distinction between language resources and processing resources is explored more fully in section C.1.1. Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection of REusable Objects for Language Engineering.

In the rest of this chapter:

4.1 The Web and CREOLE [#]

GATE allows resource implementations and Language Resource persistent data to be distributed over the Web, and uses Java annotations and XML for configuration of resources (and GATE itself).

Resource implementations are grouped together as ‘plugins’, stored at a URL (when the resources are in the local file system this can be a file:/ URL). When a plugin is loaded into GATE it looks for a configuration file called creole.xml relative to the plugin URL and uses the contents of this file to determine what resources this plugin declares and where to find the classes that implement the resource types (typically these classes are stored in a JAR file in the plugin directory). Configuration data for the resources may be stored directly in the creole.xml file, or it may be stored as Java annotations on the resource classes themselves; in either case GATE retrieves this configuration information and adds the resource definitions to the CREOLE register. When a user requests an instantiation of a resource, GATE creates an instance of the resource class in the virtual machine.

Language resource data can be stored in binary serialised form in the local file system.

4.2 The GATE Framework [#]

We can think of the GATE framework as a backplane into which users can plug CREOLE components. The user gives the system a list of URLs to search when it starts up, and components at those locations are loaded by the system.

The backplane performs these functions:

A set of components plus the framework is a deployment unit which can be embedded in another application.

At their most basic, all GATE resources are Java Beans, the Java platform’s model of software components. Beans are simply Java classes that obey certain interface conventions:

GATE uses Java Beans conventions to construct and configure resources at runtime, and defines interfaces that different component types must implement.

4.3 The Lifecycle of a CREOLE Resource [#]

CREOLE resources exhibit a variety of forms depending on the perspective they are viewed from. Their implementation is as a Java class plus an XML metadata file living at the same URL. When using GATE Developer, resources can be loaded and viewed via the resources tree (left pane) and the ‘create resource’ mechanism. When programming with GATE Embedded, they are Java objects that are obtained by making calls to GATE’s Factory class. These various incarnations are the phases of a CREOLE resource’s ‘lifecycle’. Depending on what sort of task you are using GATE for, you may use resources in any or all of these phases. For example, you may only be interested in getting a graphical view of what GATE’s ANNIE Information Extraction system (see Chapter 6) does; in this case you will use GATE Developer to load the ANNIE resources, and load a document, and create an ANNIE application and run it on the document. If, on the other hand, you want to create your own resources, or modify the Java code of an existing resource (as opposed to just modifying its grammar, for example), you will need to deal with all the lifecycle phases.

The various phases may be summarised as:

Creating a new resource from scratch (bootstrapping).
To create the binary image of a resource (a Java class in a JAR file), and the XML file that describes the resource to GATE, you need to create the appropriate .java file(s), compile them and package them as a .jar. GATE provides a bootstrap tool to start this process – see Section 7.10. Alternatively you can simply copy code from an existing resource.
Instantiating a resource in GATE Embedded.
To create a resource in your own Java code, use GATE’s Factory class (this takes care of parameterising the resource, restoring it from a database where appropriate, etc. etc.). Section 7.2 describes how to do this.
Loading a resource into GATE Developer.
To load a resource into GATE Developer, use the various ‘New ... resource’ options from the File menu and elsewhere. See Section 3.1.
Resource configuration and implementation.
GATE’s bootstrap tool will create an empty resource that does nothing. In order to achieve the behaviour you require, you’ll need to change the configuration of the resource (by editing the creole.xml file) and/or change the Java code that implements the resource. See section 4.7.

4.4 Processing Resources and Applications [#]

PRs can be combined into applications. Applications model a control strategy for the execution of PRs. In GATE, applications are called ‘controllers’ accordingly.

Currently only sequential, or pipeline, execution is supported. There are two main types of pipeline:

Simple pipelines
simply group a set of PRs together in order and execute them in turn. The implementing class is called SerialController.
Corpus pipelines
are specific for LanguageAnalysers – PRs that are applied to documents and corpora. A corpus pipeline opens each document in the corpus in turn, sets that document as a runtime parameter on each PR, runs all the PRs on the corpus, then closes the document. The implementing class is called SerialAnalyserController.

Conditional versions of these controllers are also available. These allow processing resources to be run conditionally on document features. See Section 3.7.1 for how to use these.

Controllers are themselves PRs – in particular a simple pipeline is a standard PR and a corpus pipeline is a LanguageAnalyser – so one pipeline can be nested in another. This is particularly useful with conditional controllers to group together a set of PRs that can all be turned on or off as a group.

There is also a real-time version of the corpus pipeline. When creating such a controller, a timeout parameter needs to be set which determines the maximum amount of time (in milliseconds) allowed for the processing of a document. Documents that take longer to process, are simply ignored and the execution moves to the next document after the timeout interval has lapsed.

All controllers have special handling for processing resources that implement the interface gate.creole.ControllerAwarePR. This interface provides methods that are called by the controller at the start and end of the whole application’s execution – for a corpus pipeline, this means before any document has been processed and after all documents in the corpus have been processed, which is useful for PRs that need to share data structures across the whole corpus, build aggregate statistics, etc. For full details, see the JavaDoc documentation for ControllerAwarePR.

4.5 Language Resources and Datastores [#]

Language Resources can be stored in Datastores. Datastores are an abstract model of disk-based persistence, which can be implemented by various types of storage mechanism. Here are the types implemented:

Serial Datastores
are based on Java’s serialisation system, and store data directly into files and directories.
Lucene Datastores
is a full-featured annotation indexing and retrieval system. It is provided as part of an extension of the Serial Datastores. See Section 9 for more details.

4.6 Built-in CREOLE Resources [#]

GATE comes with various built-in components:

4.7 CREOLE Resource Configuration [#]

This section describes how to supply GATE with the configuration data it needs about a resource, such as what its parameters are, how to display it if it has a visualisation, etc. Several GATE resources can be grouped into a single plugin, which is a directory containing an XML configuration file called creole.xml. Configuration data for the plugin’s resources can be given in the creole.xml file or directly in the Java source file using Java 5 annotations.

A creole.xml file has a root element <CREOLE-DIRECTORY>, but the further contents of this element depend on the configuration style. The following three sections discuss the different styles – all-XML, all-annotations and a mixture of the two.

4.7.1 Configuration with XML [#]

To configure your resources in the creole.xml file, the <CREOLE-DIRECTORY> element should contain one <RESOURCE> element for each resource type in the plugin. The <RESOURCE> elements may optionally be contained within a <CREOLE> element (to allow a single creole.xml file to be built up by concatenating multiple separate files). For example:

<CREOLE-DIRECTORY>  
 
<CREOLE>  
  <RESOURCE>  
    <NAME>Minipar Wrapper</NAME>  
    <JAR>MiniparWrapper.jar</JAR>  
    <CLASS>minipar.Minipar</CLASS>  
    <COMMENT>MiniPar is a shallow parser. It determines the  
    dependency relationships between the words of a sentence.</COMMENT>  
    <HELPURL>http://gate.ac.uk/cgi-bin/userguide/sec:parsers:minipar</HELPURL>  
    <PARAMETER NAME="document"  
  RUNTIME="true"  
  COMMENT="document to process">gate.Document</PARAMETER>  
    <PARAMETER NAME="miniparDataDir"  
        RUNTIME="true"  
        COMMENT="location of the Minipar data directory">  
        java.net.URL  
    </PARAMETER>  
    <PARAMETER NAME="miniparBinary"  
        RUNTIME="true"  
        COMMENT="Name of the Minipar command file">  
        java.net.URL  
    </PARAMETER>  
    <PARAMETER NAME="annotationInputSetName"  
        RUNTIME="true"  
        OPTIONAL="true"  
        COMMENT="Name of the input Source">  
        java.lang.String  
    </PARAMETER>  
    <PARAMETER NAME="annotationOutputSetName"  
        RUNTIME="true"  
        OPTIONAL="true"  
        COMMENT="Name of the output AnnotationSetName">  
        java.lang.String  
    </PARAMETER>  
    <PARAMETER NAME="annotationTypeName"  
        RUNTIME="false"  
        DEFAULT="DepTreeNode"  
        COMMENT="Annotations to store with this type">  
        java.lang.String  
    </PARAMETER>  
  </RESOURCE>  
</CREOLE>  
</CREOLE-DIRECTORY>

Basic Resource-Level Data

Each resource must give a name, a Java class and the JAR file that it can be loaded from. The above example is taken from the Parser_Minipar plugin, and defines a single resource with a number of parameters.

The full list of valid elements under <RESOURCE> is as follows:

NAME
the name of the resource, as it will appear in the ‘New’ menu in GATE Developer. If omitted, defaults to the bare name of the resource class (without a package name).
CLASS
the fully qualified name of the Java class that implements this resource.
JAR
names JAR files required by this resource (paths are relative to the location of creole.xml). Typically this will be the JAR file containing the class named by the <CLASS> element, but additional <JAR> elements can be used to name third-party JAR files that the resource depends on.
COMMENT
a descriptive comment about the resource, which will appear as the tooltip when hovering over an instance of this resource in the resources tree in GATE Developer. If omitted, no comment is used.
HELPURL
a URL to a help document on the web for this resource. It is used in the help browser inside GATE Developer.
INTERFACE
the interface type implemented by this resource, for example new types of document would specify <INTERFACE>gate.Document</INTERFACE>.
ICON
the icon used to represent this resource in GATE Developer. This is a path inside the plugin’s JAR file, for example <ICON>/some/package/icon.png</ICON>. If the path specified does not start with a forward slash, it is assumed to name an icon from the GATE default set, which is located in gate.jar at gate/resources/img. If no icon is specified, a generic language resource or processing resource icon (as appropriate) is used.
PRIVATE
if present, this resource type is hidden in the GATE Developer GUI, i.e. it is not shown in the ‘New’ menus. This is useful for resource types that are intended to be created internally by other resources, or for resources that have parameters of a type that cannot be set in the GUI. <PRIVATE/> resources can still be created in Java code using the Factory.
AUTOINSTANCE (and HIDDEN-AUTOINSTANCE)
tells GATE to automatically create instances of this resource when the plugin is loaded. Any number of auto instances may be defined, GATE will create them all. Each <AUTOINSTANCE> element may optionally contain <PARAM NAME="..." VALUE="..." /> elements giving parameter values to use when creating the instance. Any parameters not specified explicitly will take their default values. Use <HIDDEN-AUTOINSTANCE> if you want the auto instances not to show up in GATE Developer – this is useful for things like document formats where there should only ever be a single instance in GATE and that instance should not be deleted.

For visual resources, a <GUI> element should also be provided. This takes a TYPE attribute, which can have the value LARGE or SMALL. LARGE means that the visual resource is a large viewer and should appear in the main part of the GATE Developer window on the right hand side, SMALL means the VR is a small viewer which appears in the space below the resources tree in the bottom left. The <GUI> element supports the following sub-elements:

RESOURCE_DISPLAYED
the type of GATE resource this VR can display. Any resource whose type is assignable to this type will be displayed with this viewer, so for example a VR that can display all types of document would specify gate.Document, whereas a VR that can only display the default GATE document implementation would specify gate.corpora.DocumentImpl.
MAIN_VIEWER
if present, GATE will consider this VR to be the ‘most important’ viewer for the given resource type, and will ensure that if several different viewers are all applicable to this resource, this viewer will be the one that is initially visible.

For annotation viewers, you should specify an <ANNOTATION_TYPE_DISPLAYED> element giving the annotation type that the viewer can display (e.g. Sentence).

Resource Parameters

Resources may also have parameters of various types. These resources, from the GATE distribution, illustrate the various types of parameters:

<RESOURCE>  
  <NAME>GATE document</NAME>  
  <CLASS>gate.corpora.DocumentImpl</CLASS>  
  <INTERFACE>gate.Document</INTERFACE>  
  <COMMENT>GATE transient document</COMMENT>  
  <OR>  
    <PARAMETER NAME="sourceUrl"  
      SUFFIXES="txt;text;xml;xhtm;xhtml;html;htm;sgml;sgm;mail;email;eml;rtf"  
      COMMENT="Source URL">java.net.URL</PARAMETER>  
    <PARAMETER NAME="stringContent"  
      COMMENT="The content of the document">java.lang.String</PARAMETER>  
  </OR>  
  <PARAMETER  
    COMMENT="Should the document read the original markup"  
    NAME="markupAware" DEFAULT="true">java.lang.Boolean</PARAMETER>  
  <PARAMETER NAME="encoding" OPTIONAL="true"  
    COMMENT="Encoding" DEFAULT="">java.lang.String</PARAMETER>  
  <PARAMETER NAME="sourceUrlStartOffset"  
    COMMENT="Start offset for documents based on ranges"  
    OPTIONAL="true">java.lang.Long</PARAMETER>  
  <PARAMETER NAME="sourceUrlEndOffset"  
    COMMENT="End offset for documents based on ranges"  
    OPTIONAL="true">java.lang.Long</PARAMETER>  
  <PARAMETER NAME="preserveOriginalContent"  
    COMMENT="Should the document preserve the original content"  
    DEFAULT="false">java.lang.Boolean</PARAMETER>  
  <PARAMETER NAME="collectRepositioningInfo"  
    COMMENT="Should the document collect repositioning information"  
    DEFAULT="false">java.lang.Boolean</PARAMETER>  
  <ICON>lr.gif</ICON>  
</RESOURCE>

<RESOURCE>  
  <NAME>Document Reset PR</NAME>  
  <CLASS>gate.creole.annotdelete.AnnotationDeletePR</CLASS>  
  <COMMENT>Document cleaner</COMMENT>  
  <PARAMETER NAME="document" RUNTIME="true">gate.Document</PARAMETER>  
  <PARAMETER NAME="annotationTypes" RUNTIME="true"  
    OPTIONAL="true">java.util.ArrayList</PARAMETER>  
</RESOURCE>

Parameters may be optional, and may have default values (and may have comments to describe their purpose, which is displayed by GATE Developer during interactive parameter setting).

Some PR parameters are execution time (RUNTIME), some are initialisation time. E.g. at execution time a doc is supplied to a language analyser; at initialisation time a grammar may be supplied to a language analyser.

The <PARAMETER> tag takes the following attributes:

NAME:
name of the JavaBean property that the parameter refers to, i.e. for a parameter named ‘someParam’ the class must have setSomeParam and getSomeParam methods.1
DEFAULT:
default value (see below).
RUNTIME:
doesn’t need setting at initialisation time, but must be set before calling execute(). Only meaningful for PRs
OPTIONAL:
not required
COMMENT:
for display purposes
ITEM_CLASS_NAME:
(only applies to parameters whose type is java.util.Collection or a type that implements or extends this) this specifies the type of elements the collection contains, so GATE can use the right type when parameters are set. If omitted, GATE will pass in the elements as Strings.
SUFFIXES:
(only applies to parameters of type java.net.URL) a semicolon-separated list of file suffixes that this parameter typically accepts, used as a filter in the file chooser provided by GATE Developer to select a local file as the parameter value.

It is possible for two or more parameters to be mutually exclusive (i.e. a user must specify one or the other but not both). In this case the <PARAMETER> elements should be grouped together under an <OR> element.

The type of the parameter is specified as the text of the <PARAMETER> element, and the type supplied must match the return type of the parameter’s get method. Any reference type (class, interface or enum) may be used as the parameter type, including other resource types – in this case GATE Developer will offer a list of the loaded instances of that resource as options for the parameter value. Primitive types (char, boolean, …) are not supported, instead you should use the corresponding wrapper type (java.lang.Character, java.lang.Boolean, …). If the getter returns a parameterized type (e.g. List<Integer>) you should just specify the raw type (java.util.List) here2.

The DEFAULT string is converted to the appropriate type for the parameter - java.lang.String parameters use the value directly, primitive wrapper types e.g. java.lang.Integer use their respective valueOf methods, and other built-in Java types can have defaults specified provided they have a constructor taking a String.

The type java.net.URL is treated specially: if the default string is not an absolute URL (e.g. http://gate.ac.uk/) then it is treated as a path relative to the location of the creole.xml file. Thus a DEFAULT of ‘resources/main.jape’ in the file file:/opt/MyPlugin/creole.xml is treated as the absolute URL file:/opt/MyPlugin/resources/main.jape.

For Collection-valued parameters multiple values may be specified, separated by semicolons, e.g. ‘foo;bar;baz’; if the parameter’s type is an interface – Collection or one of its sub-interfaces (e.g. List) – a suitable concrete class (e.g. ArrayList, HashSet) will be chosen automatically for the default value.

For parameters of type gate.FeatureMap multiple name=value pairs can be specified, e.g. ‘kind=word;orth=upperInitial’. For enum-valued parameters the default string is taken as the name of the enum constant to use. Finally, if no DEFAULT attribute is specified, the default value is null.

4.7.2 Configuring Resources using Annotations [#]

As an alternative to the XML configuration style, GATE provides Java 5 annotation types to embed the configuration data directly in the Java source code. @CreoleResource is used to mark a class as a GATE resource, and parameter information is provided through annotations on the JavaBean set methods. At runtime these annotations are read and mapped into the equivalent entries in creole.xml before parsing. The metadata annotation types are all marked @Documented so the CREOLE configuration data will be visible in the generated JavaDoc documentation.

For more detailed information, see the JavaDoc documentation for gate.creole.metadata.

To use annotation-driven configuration a creole.xml file is still required but it need only contain the following:

<CREOLE-DIRECTORY>  
  <JAR SCAN="true">myPlugin.jar</JAR>  
  <JAR>lib/thirdPartyLib.jar</JAR>  
</CREOLE-DIRECTORY>

This tells GATE to load myPlugin.jar and scan its contents looking for resource classes annotated with @CreoleResource. Other JAR files required by the plugin can be specified using other <JAR> elements without SCAN="true".

Basic Resource-Level Data

To mark a class as a CREOLE resource, simply use the @CreoleResource annotation (in the gate.creole.metadata package), for example:

 
1import gate.creole.AbstractLanguageAnalyser; 
2import gate.creole.metadata.*; 
3 
4@CreoleResource(name = "GATE Tokeniser", 
5                comment = "Splits text into tokens and spaces") 
6public class Tokeniser extends AbstractLanguageAnalyser { 
7  ...

The @CreoleResource annotation provides slots for all the values that can be specified under <RESOURCE> in creole.xml, except <CLASS> (inferred from the name of the annotated class) and <JAR> (taken to be the JAR containing the class):

name
(String) the name of the resource, as it will appear in the ‘New’ menu in GATE Developer. If omitted, defaults to the bare name of the resource class (without a package name). (XML equivalent <NAME>)
comment
(String) a descriptive comment about the resource, which will appear as the tooltip when hovering over an instance of this resource in the resources tree in GATE Developer. If omitted, no comment is used. (XML equivalent <COMMENT>)
helpURL
(String) a URL to a help document on the web for this resource. It is used in the help browser inside GATE Developer. (XML equivalent <HELPURL>)
isPrivate
(boolean) should this resource type be hidden from the GATE Developer GUI, so it does not appear in the ‘New’ menus? If omitted, defaults to false (i.e. not hidden). (XML equivalent <PRIVATE/>)
icon
(String) the icon to use to represent the resource in GATE Developer. If omitted, a generic language resource or processing resource icon is used. (XML equivalent <ICON>, see the description above for details)
interfaceName
(String) the interface type implemented by this resource, for example a new type of document would specify "gate.Document" here. (XML equivalent <INTERFACE>)
autoInstances
(array of @AutoInstance annotations) definitions for any instances of this resource that should be created automatically when the plugin is loaded. If omitted, no auto-instances are created by default. (XML equivalent, one or more <AUTOINSTANCE> and/or <HIDDEN-AUTOINSTANCE> elements, see the description above for details)

For visual resources only, the following elements are also available:

guiType
(GuiType enum) the type of GUI this resource defines. (XML equivalent <GUI TYPE="LARGE|SMALL">)
resourceDisplayed
(String) the class name of the resource type that this VR displays, e.g. "gate.Corpus". (XML equivalent <RESOURCE_DISPLAYED>)
mainViewer
(boolean) is this VR the ‘most important’ viewer for its displayed resource type? (XML equivalent <MAIN_VIEWER/>, see above for details)

For annotation viewers, you should specify an annotationTypeDisplayed element giving the annotation type that the viewer can display (e.g. Sentence).

Resource Parameters

Parameters are declared by placing annotations on their JavaBean set methods. To mark a setter method as a parameter, use the @CreoleParameter annotation, for example:

  @CreoleParameter(comment = "The location of the list of abbreviations")  
  public void setAbbrListUrl(URL listUrl) {  
    ...

GATE will infer the parameter’s name from the name of the JavaBean property in the usual way (i.e. strip off the leading set and convert the following character to lower case, so in this example the name is abbrListUrl). The parameter name is not taken from the name of the method parameter. The parameter’s type is inferred from the type of the method parameter (java.net.URL in this case).

The annotation elements of @CreoleParameter correspond to the attributes of the <PARAMETER> tag in the XML configuration style:

comment
(String) an optional descriptive comment about the parameter. (XML equivalent COMMENT)
defaultValue
(String) the optional default value for this parameter. The value is specified as a string but is converted to the relevant type by GATE according to the conversions described in the previous section. Note that relative path default values for URL-valued parameters are still relative to the location of the creole.xml file, not the annotated class. (XML equivalent DEFAULT)
suffixes
(String) for URL-valued parameters, a semicolon-separated list of default file suffixes that this parameter accepts. (XML equivalent SUFFIXES)
collectionElementType
(Class) for Collection-valued parameters, the type of the elements in the collection. This can usually be inferred from the generic type information, for example public void setIndices(List<Integer> indices), but must be specified if the set method’s parameter has a raw (non-parameterized) type. (XML equivalent ITEM_CLASS_NAME)

Mutually-exclusive parameters (such as would be grouped in an <OR> in creole.xml) are handled by adding a disjunction="label" to the @CreoleParameter annotation – all parameters that share the same label are grouped in the same disjunction.

Optional and runtime parameters are marked using extra annotations, for example:

 
1  @Optional 
2  @RunTime 
3  @CreoleParameter 
4  public void setAnnotationSetName(String asName) { 
5    ...

Inheritance

Unlike with pure XML configuration, when using annotations a resource will inherit any configuration data that was not explicitly specified from annotations on its parent class and on any interfaces it implements. Specifically, if you do not specify a comment, interfaceName, icon, annotationTypeDisplayed or the GUI-related elements (guiType and resourceDisplayed) on your @CreoleResource annotation then GATE will look up the class tree for other @CreoleResource annotations, first on the superclass, its superclass, etc., then at any implemented interfaces, and use the first value it finds. This is useful if you are defining a family of related resources that inherit from a common base class.

The resource name and the isPrivate and mainViewer flags are not inherited.

Parameter definitions are inherited in a similar way. This is one of the big advantages of annotation configuration over pure XML – if one resource class extends another then with pure XML configuration all the parent class’s parameter definitions must be duplicated in the subclass’s creole.xml definition. With annotations, parameters are inherited from the parent class (and its parent, etc.) as well as from any interfaces implemented. For example, the gate.LanguageAnalyser interface provides two parameter definitions via annotated set methods, for the corpus and document parameters. Any @CreoleResource annotated class that implements LanguageAnalyser, directly or indirectly, will get these parameters automatically.

Of course, there are some cases where this behaviour is not desirable, for example if a subclass calculates a value for a superclass parameter rather than having the user set it directly. In this case you can hide the parameter by overriding the set method in the subclass and using a marker annotation:

 
1  @HiddenCreoleParameter 
2  public void setSomeParam(String someParam) { 
3    super.setSomeParam(someParam); 
4  }

The overriding method will typically just call the superclass one, as its only purpose is to provide a place to put the @HiddenCreoleParameter annotation.

Alternatively, you may want to override some of the configuration for a parameter but inherit the rest from the superclass. Again, this is handled by trivially overriding the set method and re-annotating it:

 
1  // superclass 
2  @CreoleParameter(comment = "Location of the grammar file", 
3                   suffixes = "jape") 
4  public void setGrammarUrl(URL grammarLocation) { 
5    ... 
6  } 
7 
8  @Optional 
9  @RunTime 
10  @CreoleParameter(comment = "Feature to set on success") 
11  public void setSuccessFeature(String name) { 
12    ... 
13  }
 
1  //----------------------------------- 
2  // subclass 
3 
4  // override the default value, inherit everything else 
5  @CreoleParameter(defaultValue = "resources/defaultGrammar.jape") 
6  public void setGrammarUrl(URL url) { 
7    super.setGrammarUrl(url); 
8  } 
9 
10  // we want the parameter to be required in the subclass 
11  @Optional(false) 
12  @CreoleParameter 
13  public void setSuccessFeature(String name) { 
14    super.setSuccessFeature(name); 
15  }

Note that for backwards compatibility, data is only inherited from superclass annotations if the subclass is itself annotated with @CreoleResource. If the subclass is not annotated then GATE assumes that all its configuration is contained in creole.xml in the usual way.

4.7.3 Mixing the Configuration Styles [#]

It is possible and often useful to mix and match the XML and annotation-driven configuration styles. The rule is always that anything specified in the XML takes priority over the annotations. The following examples show what this allows.

Overriding Configuration for a Third-Party Resource

Suppose you have a plugin from some third party that uses annotation-driven configuration. You don’t have the source code but you would like to override the default value for one of the parameters of one of the plugin’s resources. You can do this in the creole.xml:

<CREOLE-DIRECTORY>  
  <JAR SCAN="true">acmePlugin-1.0.jar</JAR>  
 
  <!-- Add the following to override the annotations -->  
  <RESOURCE>  
    <CLASS>com.acme.plugin.UsefulPR</CLASS>  
    <PARAMETER NAME="listUrl"  
      DEFAULT="resources/myList.txt">java.net.URL</PARAMETER>  
  </RESOURCE>  
</CREOLE-DIRECTORY>

The default value for the listUrl parameter in the annotated class will be replaced by your value.

External AUTOINSTANCEs

For resources like document formats, where there should always and only be one instance in GATE at any time, it makes sense to put the auto-instance definitions in the @CreoleResource annotation. But if the automatically created instances are a convenience rather than a necessity it may be better to define them in XML so other users can disable them without re-compiling the class:

<CREOLE-DIRECTORY>  
  <JAR SCAN="true">myPlugin.jar</JAR>  
 
  <RESOURCE>  
    <CLASS>com.acme.AutoPR</CLASS>  
    <AUTOINSTANCE>  
      <PARAM NAME="type" VALUE="Sentence" />  
    </AUTOINSTANCE>  
    <AUTOINSTANCE>  
      <PARAM NAME="type" VALUE="Paragraph" />  
    </AUTOINSTANCE>  
  </RESOURCE>  
</CREOLE-DIRECTORY>

Inheriting Parameters

If you would prefer to use XML configuration for your own resources, but would like to benefit from the parameter inheritance features of the annotation-driven approach, you can write a normal creole.xml file with all your configuration and just add a blank @CreoleResource annotation to your class. For example:

 
1package com.acme; 
2import gate.*; 
3import gate.creole.metadata.CreoleResource; 
4 
5@CreoleResource 
6public class MyPR implements LanguageAnalyser { 
7  ... 
8}
<!-- creole.xml -->  
<CREOLE-DIRECTORY>  
  <CREOLE>  
    <RESOURCE>  
      <NAME>My Processing Resource</NAME>  
      <CLASS>com.acme.MyPR</CLASS>  
      <COMMENT>...</COMMENT>  
      <PARAMETER NAME="annotationSetName"  
        RUNTIME="true" OPTIONAL="true">java.lang.String</PARAMETER>  
      <!--  
      don’t need to declare document and corpus parameters, they  
      are inherited from LanguageAnalyser  
      -->  
    </RESOURCE>  
  </CREOLE>  
</CREOLE-DIRECTORY>

N.B. Without the @CreoleResource the parameters would not be inherited.

Chapter 5
Language Resources: Corpora, Documents and Annotations [#]

Sometimes in life you’ve got to dance like nobody’s watching.

I think they should introduce ‘sleeping’ to the Olympics. It would be an excellent field event, in which the ‘athletes’ (for want of a better word) all lay down in beds, just beyond where the javelins land, and the first one to fall asleep and not wake up for three hours would win gold. I, for one, would be interested in seeing what kind of personality would be suited to sleeping in a competitive environment.

Life is a mystery to be lived, not a problem to be solved.

Round Ireland with a Fridge, Tony Hawks, 1998 (pp. 119, 147, 179).

This chapter documents GATE’s model of corpora, documents and annotations on documents. Section 5.1 describes the simple attribute/value data model that corpora, documents and annotations all share. Section 5.2, Section 5.3 and Section 5.4 describe corpora, documents and annotations on documents respectively. Section 5.5 describes GATE’s support for diverse document formats, and Section 5.5.2 describes facilities for XML input/output.

5.1 Features: Simple Attribute/Value Data [#]

GATE has a single model for information that describes documents, collections of documents (corpora), and annotations on documents, based on attribute/value pairs. Attribute names are strings; values can be any Java object. The API for accessing this feature data is Java’s Map interface (part of the Collections API).

5.2 Corpora: Sets of Documents plus Features [#]

A Corpus in GATE is a Java Set whose members are Documents. Both Corpora and Documents are types of LanguageResource (LR); all LRs have a FeatureMap (a Java Map) associated with them that stored attribute/value information about the resource. FeatureMaps are also used to associate arbitrary information with ranges of documents (e.g. pieces of text) via the annotation model (see below).

Documents have a DocumentContent which is a text at present (future versions may add support for audiovisual content) and one or more AnnotationSets which are Java Sets.

5.3 Documents: Content plus Annotations plus Features [#]

Documents are modelled as content plus annotations (see Section 5.4) plus features (see Section 5.1). The content of a document can be any subclass of DocumentContent.

5.4 Annotations: Directed Acyclic Graphs [#]

Annotations are organised in graphs, which are modelled as Java sets of Annotation. Annotations may be considered as the arcs in the graph; they have a start Node and an end Node, an ID, a type and a FeatureMap. Nodes have pointers into the sources document, e.g. character offsets.

5.4.1 Annotation Schemas [#]

Annotation schemas provide a means to define types of annotations in GATE. GATE uses the XML Schema language supported by W3C for these definitions. When using GATE Developer to create/edit annotations, a component is available (gate.gui.SchemaAnnotationEditor) which is driven by an annotation schema file. This component will constrain the data entry process to ensure that only annotations that correspond to a particular schema are created. (Another component allows unrestricted annotations to be created.)

Schemas are resources just like other GATE components. Below we give some examples of such schemas. Section 3.4.6 describes how to create new schemas.

Date Schema
<?xml version="1.0"?>  
<schema  
xmlns="http://www.w3.org/2000/10/XMLSchema">  
 <!-- XSchema deffinition for Date-->  
  <element name="Date">  
    <complexType>  
      <attribute name="kind"  use="optional">  
        <simpleType>  
          <restriction base="string">  
            <enumeration value="date"/>  
            <enumeration value="time"/>  
            <enumeration value="dateTime"/>  
          </restriction>  
        </simpleType>  
    </attribute>  
  </complexType>  
 </element>  
</schema>

Person Schema
<?xml version="1.0"?>  
<schema  
xmlns="http://www.w3.org/2000/10/XMLSchema">  
    <!-- XSchema definition for Person-->  
    <element name="Person" />  
</schema>

Address Schema
<?xml version="1.0"?> <schema  
xmlns="http://www.w3.org/2000/10/XMLSchema">  
    <!-- XSchema deffinition for Address-->  
    <element name="Address">  
      <complexType>  
        <attribute name="kind"  use="optional">  
          <simpleType>  
            <restriction base="string">  
              <enumeration value="email"/>  
              <enumeration value="url"/>  
              <enumeration value="phone"/>  
              <enumeration value="ip"/>  
              <enumeration value="street"/>  
              <enumeration value="postcode"/>  
              <enumeration value="country"/>  
              <enumeration value="complete"/>  
            </restriction>  
        </simpleType>  
    </attribute>  
  </complexType>  
</element>  
</schema>

5.4.2 Examples of Annotated Documents [#]

This section shows some simple examples of annotated documents.

This material is adapted from [Grishman 97], the TIPSTER Architecture Design document upon which GATE version 1 was based. Version 2 has a similar model, although annotations are now graphs, and instead of multiple spans per annotation each annotation now has a single start/end node pair. The current model is largely compatible with [Bird & Liberman 99], and roughly isomorphic with "stand-off markup" as latterly adopted by the SGML/XML community.

Each example is shown in the form of a table. At the top of the table is the document being annotated; immediately below the line with the document is a ruler showing the position (byte offset) of each character (see TIPSTER Architecture Design Document).

Underneath this appear the annotations, one annotation per line. For each annotation is shown its Id, Type, Span (start/end offsets derived from the start/end nodes), and Features. Integers are used as the annotation Ids. The features are shown in the form name = value.

The first example shows a single sentence and the result of three annotation procedures: tokenization with part-of-speech assignment, name recognition, and sentence boundary recognition. Each token has a single feature, its part of speech (pos), using the tag set from the University of Pennsylvania Tree Bank; each name also has a single feature, indicating the type of name: person, company, etc.







Text





Cyndi savored the soup.





^0...^5...^10..^15..^20





Annotations





IdType SpanStartSpan EndFeatures





1 token 0 5 pos=NP





2 token 6 13 pos=VBD





3 token 14 17 pos=DT





4 token 18 22 pos=NN





5 token 22 23





6 name 0 5 name_type=person





7 sentence0 23






Table 5.1: Result of annotation on a single sentence

Annotations will typically be organized to describe a hierarchical decomposition of a text. A simple illustration would be the decomposition of a sentence into tokens. A more complex case would be a full syntactic analysis, in which a sentence is decomposed into a noun phrase and a verb phrase, a verb phrase into a verb and its complement, etc. down to the level of individual tokens. Such decompositions can be represented by annotations on nested sets of spans. Both of these are illustrated in the second example, which is an elaboration of our first example to include parse information. Each non-terminal node in the parse tree is represented by an annotation of type parse.







Text





Cyndi savored the soup.





^0...^5...^10..^15..^20





Annotations





IdType SpanStartSpan EndFeatures





1 token 0 5 pos=NP





2 token 6 13 pos=VBD





3 token 14 17 pos=DT





4 token 18 22 pos=NN





5 token 22 23





6 name 0 5 name_type=person





7 sentence0 23 constituents=[1],[2],[3].[4],[5]






Table 5.2: Result of annotations including parse information

In most cases, the hierarchical structure could be recovered from the spans. However, it may be desirable to record this structure directly through a constituents feature whose value is a sequence of annotations representing the immediate constituents of the initial annotation. For the annotations of type parse, the constituents are either non-terminals (other annotations in the parse group) or tokens. For the sentence annotation, the constituents feature points to the constituent tokens. A reference to another annotation is represented in the table as "[ Annotation Id]"; for example, "[3]" represents a reference to annotation 3. Where the value of an feature is a sequence of items, these items are separated by commas. No special operations are provided in the current architecture for manipulating constituents. At a less esoteric level, annotations can be used to record the overall structure of documents, including in particular documents which have structured headers, as is shown in the third example (Table 5.3).







Text





To: All Barnyard Animals





^0...^5...^10..^15..^20.





From: Chicken Little





^25..^30..^35..^40..





Date: November 10,1194





...^50..^55..^60..^65.





Subject: Descending Firmament





.^70..^75..^80..^85..^90..^95





Priority: Urgent





.^100.^105.^110.





The sky is falling. The sky is falling.





....^120.^125.^130.^135.^140.^145.^150.





Annotations





IdType SpanStartSpan EndFeatures





1 Addressee4 24





2 Source 31 45





3 Date 53 69 ddmmyy=101194





4 Subject 78 98





5 Priority 109 115





6 Body 116 155





7 Sentence 116 135





8 Sentence 136 155






Table 5.3: Annotation showing overall document structure

If the Addressee, Source, ... annotations are recorded when the document is indexed for retrieval, it will be possible to perform retrieval selectively on information in particular fields. Our final example (Table 5.4) involves an annotation which effectively modifies the document. The current architecture does not make any specific provision for the modification of the original text. However, some allowance must be made for processes such as spelling correction. This information will be recorded as a correction feature on token annotations and possibly on name annotations:







Text





Topster tackles 2 terrorbytes.





^0...^5...^10..^15..^20..^25..





Annotations





IdTypeSpanStartSpan EndFeatures





1 token0 7 pos=NP correction=TIPSTER





2 token8 15 pos=VBZ





3 token16 17 pos=CD





4 token18 29 pos=NNS correction=terabytes





5 token29 30






Table 5.4: Annotation modifying the document

5.4.3 Creating, Viewing and Editing Diverse Annotation Types [#]

Note that annotation types should consist of a single word with no spaces. Otherwise they may not be recognised by other components such as JAPE transducers, and may create problems when annotations are saved as inline (‘Save Preserving Format’ in the context menu).

To view and edit annotation types, see Section 3.4. To add annotations of a new type, see Section 3.4.5. To add a new annotation schema, see Section 3.4.6.

5.5 Document Formats [#]

The following document formats are supported by GATE:

By default GATE will try and identify the type of the document, then strip and convert any markup into GATE’s annotation format. To disable this process, set the markupAware parameter on the document to false.

When reading a document of one of these types, GATE extracts the text between tags (where such exist) and create a GATE annotation filled as follows:

The name of the tag will constitute the annotation’s type, all the tags attributes will materialize in the annotation’s features and the annotation will span over the text covered by the tag. A few exceptions of this rule apply for the RTF, Email and Plain Text formats, which will be described later in the input section of these formats.

The text between tags is extracted and appended to the GATE document’s content and all annotations created from tags will be placed into a GATE annotation set named ‘Original markups’.

Example:

If the markup is like this:

<aTagName attrib1="value1" attrib2="value2" attrib3="value3"> A  
piece of text</aTagName>

then the annotation created by GATE will look like:

annotation.type = "aTagName";  
annotation.fm = {attrib1=value1;atrtrib2=value2;attrib3=value3};  
annotation.start = startNode;  
annotation.end = endNode;

The startNode and endNode are created from offsets referring the beginning and the end of ‘A piece of text’ in the document’s content.

The documents supported by GATE have to be in one of the encodings accepted by Java. The most popular is the ‘UTF-8’ encoding which is also the most storage efficient one for UNICODE. If, when loading a document in GATE the encoding parameter is set to ‘’(the empty string), then the default encoding of the platform will be used.

5.5.1 Detecting the Right Reader [#]

In order to successfully apply the document creation algorithm described above, GATE needs to detect the proper reader to use for each document format. If the user knows in advance what kind of document they are loading then they can specify the MIME type (e.g. text/html) using the init parameter mimeType, and GATE will respect this. If an explicit type is not given, GATE attempts to determine the type by other means, taking into consideration (where possible) the information provided by three sources:

The first represents the extension of a file like (xml,htm,html,txt,sgm,rtf, etc), the second represents the HTTP information sent by a web server regarding the content type of the document being send by it (text/html; text/xml, etc), and the third one represents certain sequences of chars which are ultimately number sequences. GATE is capable of supporting multimedia documents, if the right reader is added to the framework. Sometimes, multimedia documents are identified by a signature consisting in a sequence of numbers. Inside GATE they are called magic numbers. For textual documents, certain char sequences form such magic numbers. Examples of magic numbers sequences will be provided in the Input section of each format supported by GATE.

All those tests are applied to each document read, and after that, a voting mechanism decides what is the best reader to associate with the document. There is a degree of priority for all those tests. The document’s extension test has the highest priority. If the system is in doubt which reader to choose, then the one associated with document’s extension will be selected. The next higher priority is given to the web server’s content type and the third one is given to the magic numbers detection. However, any two tests that identify the same mime type, will have the highest priority in deciding the reader that will be used. The web server test is not always successful as there might be documents that are loaded from a local file system, and the magic number detection test is not always applicable. In the next paragraphs we will se how those tests are performed and what is the general mechanism behind reader detection.

The method that detects the proper reader is a static one, and it belongs to the gate.DocumentFormat class. It uses the information stored in the maps filled by the init() method of each reader. This method comes with three signatures:

 
1static public DocumentFormat getDocumentFormat( gate.Document 
2aGateDocument, URL url) 
3 
4static public DocumentFormat getDocumentFormat(gate.Document 
5aGateDocument, String fileSuffix) 
6 
7static public DocumentFormat getDocumentFormat(gate.Document 
8aGateDocument, MimeType mimeType)

The first two methods try to detect the right MimeType for the GATE document, and after that, they call the third one to return the reader associate with a MimeType. Of course, if an explicit mimeType parameter was specified, GATE calls the third form of the method directly, passing the specified type. GATE uses the implementation from ‘http://jigsaw.w3.org’ for mime types.

The magic numbers test is performed using the information form
magic2mimeTypeMap map. Each key from this map, is searched in the first bufferSize (the default value is 2048) chars of text. The method that does this is called
runMagicNumbers(InputStreamReader aReader) and it belongs to DocumentFormat class. More details about it can be found in the GATE API documentation.

In order to activate a reader to perform the unpacking, the creole definition of a GATE document defines a parameter called ‘markupAware’ initialized with a default value of true. This parameter, forces GATE to detect a proper reader for the document being read. If no reader is found, the document’s content is load and presented to the user, just like any other text editor (this for textual documents).

The next subsections investigates particularities for each format and will describe the file extensions registered with each document format.

5.5.2 XML [#]

Input [#]

GATE permits the processing of any XML document and offers support for XML namespaces. It benefits the power of Apache’s Xerces parser and also makes use of Sun’s JAXP layer. Changing the XML parser in GATE can be achieved by simply replacing the value of a Java system property (‘javax.xml.parsers.SAXParserFactory’).

GATE will accept any well formed XML document as input. Although it has the possibility to validate XML documents against DTDs it does not do so because the validating procedure is time consuming and in many cases it issues messages that are annoying for the user.

There is an open problem with the general approach of reading XML, HTML and SGML documents in GATE. As we previously said, the text covered by tags/elements is appended to the GATE document content and a GATE annotation refers to this particular span of text. When appending, in cases such as ‘end.</P><P>Start’ it might happen that the ending word of the previous annotation is concatenated with the beginning phrase of the annotation currently being created, resulting in a garbage input for GATE processing resources that operate at the text surface.

Let’s take another example in order to better understand the problem:

<title>This is a title</title><p>This is a paragraph</p><a  
href="#link">Here is an useful link</a>

When the markup is transformed to annotations, it is likely that the text from the document’s content will be as follows:

This is a titleThis is a paragraphHere is an useful link

The annotations created will refer the right parts of the texts but for the GATE’s processing resources like (tokenizer, gazetter, etc) which work on this text, this will be a major disaster. Therefore, in order to prevent this problem from happening, GATE checks if it’s likely to join words and if this happens then it inserts a space between those words. So, the text will look like this after loaded in GATE Developer:

This is a title This is a paragraph Here is an useful link

There are cases when these words are meant to be joined, but they are rare. This is why it’s an open problem.

The extensions associate with the XML reader are:

The web server content type associate with xml documents is: text/xml.

The magic numbers test searches inside the document for the XML(<?xml version="1.0") signature. It is also able to detect if the XML document uses the semantics described in the GATE document format DTD (see 5.5.2 below) or uses other semantics.

Output [#]

GATE is capable of ensuring persistence for its resources. The types of persistent storage used for Language Resources are:

We describe the latter case here.

XML persistence doesn’t necessarily preserve all the objects belonging to the annotations, documents or corpora. Their features can be of all kinds of objects, with various layers of nesting. For example, lists containing lists containing maps, etc. Serializing these arbitrary data types in XML is not a simple task; GATE does the best it can, and supports native Java types such as Integers and Booleans, but where complex data types are used, information may be lost(the types will be converted into Strings). GATE provides a full serialization of certain types of features such as collections, strings and numbers. It is possible to serialize only those collections containing strings or numbers. The rest of other features are serialized using their string representation and when read back, they will be all strings instead of being the original objects. Consequences of this might be observed when performing evaluations (see Chapter 10).

When GATE outputs an XML document it may do so in one of two ways:

In the former case, the XML output will be close to the original document. In the latter case, the format is a GATE-specific one which can be read back by the system to recreate all the information that GATE held internally for the document.

In order to understand why there are two types of XML serialization, one needs to understand the structure of a GATE document. GATE allows a graph of annotations that refer to parts of the text. Those annotations are grouped under annotation sets. Because of this structure, sometimes it is impossible to save a document as XML using tags that surround the text referred to by the annotation, because tags crossover situations could appear (XML is essentially a tree-based model of information, whereas GATE uses graphs). Therefore, in order to preserve all annotations in a GATE document, a custom type of XML document was developed.

The problem of crossover tags appears with GATE’s second option (the preserve format one), which is implemented at the cost of losing certain annotations. The way it is applied in GATE is that it tries to restore the original markup and where it is possible, to add in the same manner annotations produced by GATE.

How to Access and Use the Two Forms of XML Serialization

Save as XML Option This option is available in GATE Developer in the pop-up menu associated with each language resource (document or corpus). Saving a corpus as XML is done by calling ‘Save as XML’ on each document of the corpus. This option saves all the annotations of a document together their features(applying the restrictions previously discussed), using the GateDocument.dtd :

 <!ELEMENT GateDocument (GateDocumentFeatures,  
           TextWithNodes, (AnnotationSet+))>  
 <!ELEMENT GateDocumentFeatures (Feature+)>  
 <!ELEMENT Feature (Name, Value)>  
 <!ELEMENT Name (\#PCDATA)>  
 <!ELEMENT Value (\#PCDATA)>  
 <!ELEMENT TextWithNodes (\#PCDATA | Node)*>  
 <!ELEMENT AnnotationSet (Annotation*)>  
 <!ATTLIST AnnotationSet  Name CDATA \#IMPLIED>  
 <!ELEMENT Annotation (Feature*)>  
 <!ATTLIST Annotation  Type      CDATA \#REQUIRED  
                       StartNode CDATA \#REQUIRED  
                       EndNode   CDATA \#REQUIRED>  
 <!ELEMENT Node EMPTY>  
 <!ATTLIST Node id CDATA \#REQUIRED>

The document is saved under a name chosen by the user and it may have any extension. However, the recommended extension would be ‘xml’.

Using GATE Embedded, this option is available by calling gate.Document’s toXml() method. This method returns a string which is the XML representation of the document on which the method was called.

Note: It is recommended that the string representation to be saved on the file system using the UTF-8 encoding, as the first line of the string is : <?xml version="1.0" encoding="UTF-8"?>

Example of such a GATE format document:

<?xml version="1.0" encoding="UTF-8" ?>  
<GateDocument>  
 
<!-- The document’s features-->  
 
<GateDocumentFeatures>  
<Feature>  
  <Name className="java.lang.String">MimeType</Name>  
  <Value className="java.lang.String">text/plain</Value>  
</Feature>  
<Feature>  
  <Name className="java.lang.String">gate.SourceURL</Name>  
  <Value className="java.lang.String">file:/G:/tmp/example.txt</Value>  
</Feature>  
</GateDocumentFeatures>  
 
<!-- The document content area with serialized nodes -->  
 
<TextWithNodes>  
<Node id="0"/>A TEENAGER <Node  
id="11"/>yesterday<Node id="20"/> accused his parents of cruelty  
by feeding him a daily diet of chips which sent his weight  
ballooning to 22st at the age of l2<Node id="146"/>.<Node  
id="147"/>  
</TextWithNodes>  
 
<!-- The default annotation set -->  
 
<AnnotationSet>  
<Annotation Type="Date" StartNode="11"  
EndNode="20">  
<Feature>  
  <Name className="java.lang.String">rule2</Name>  
  <Value className="java.lang.String">DateOnlyFinal</Value>  
</Feature> <Feature>  
  <Name className="java.lang.String">rule1</Name>  
  <Value className="java.lang.String">GazDateWords</Value>  
</Feature> <Feature>  
  <Name className="java.lang.String">kind</Name>  
  <Value className="java.lang.String">date</Value>  
</Feature> </Annotation> <Annotation Type="Sentence" StartNode="0"  
EndNode="147"> </Annotation> <Annotation Type="Split"  
StartNode="146" EndNode="147"> <Feature>  
  <Name className="java.lang.String">kind</Name>  
  <Value className="java.lang.String">internal</Value>  
</Feature> </Annotation> <Annotation Type="Lookup" StartNode="11"  
EndNode="20"> <Feature>  
  <Name className="java.lang.String">majorType</Name>  
  <Value className="java.lang.String">date_key</Value>  
</Feature> </Annotation>  
</AnnotationSet>  
 
<!-- Named annotation set -->  
 
<AnnotationSet Name="Original markups" >  
 <Annotation  
Type="paragraph" StartNode="0" EndNode="147"> </Annotation>  
</AnnotationSet>  
</GateDocument>

Note: One must know that all features that are not collections containing numbers or strings or that are not numbers or strings are discarded. With this option, GATE does not preserve those features it cannot restore back.

The Preserve Format Option This option is available in GATE Developer from the popup menu of the annotations table. If no annotation in this table is selected, then the option will restore the document’s original markup. If certain annotations are selected, then the option will attempt to restore the original markup and insert all the selected ones. When an annotation violates the crossed over condition, that annotation is discarded and a message is issued.

This option makes it possible to generate an XML document with tags surrounding the annotation’s referenced text and features saved as attributes. All features which are collections, strings or numbers are saved, and the others are discarded. However, when read back, only the attributes under the GATE namespace (see below) are reconstructed back differently to the others. That is because GATE does not store in the XML document the information about the features class and for collections the class of the items. So, when read back, all features will become strings, except those under the GATE namespace.

One will notice that all generated tags have an attribute called ‘gateId’ under the namespace ‘http://www.gate.ac.uk’. The attribute is used when the document is read back in GATE, in order to restore the annotation’s old ID. This feature is needed because it works in close cooperation with another attribute under the same namespace, called ‘matches’. This attribute indicates annotations/tags that refer the same entity1. They are under this namespace because GATE is sensitive to them and treats them differently to all other elements with their attributes which fall under the general reading algorithm described at the beginning of this section.

The ‘gateId’ under GATE namespace is used to create an annotation which has as ID the value indicated by this attribute. The ‘matches’ attribute is used to create an ArrayList in which the items will be Integers, representing the ID of annotations that the current one matches.

Example:

If the text being processed is as follows:

<Person gate:gateId="23">John</Person> and <Person  
gate:gateId="25" gate:matches="23;25;30">John Major</Person> are  
the same person.

What GATE does when it parses this text is it creates two annotations:

a1.type = "Person"  
a1.ID = Integer(23)  
a1.start = <the start offset of  
John>  
a1.end = <the end offset of John>  
a1.featureMap = {}  
 
a2.type = "Person"  
a2.ID = Integer(25)  
a2.start = <the start offset  
of John Major>  
a2.end = <the end offset of John Major>  
a2.featureMap = {matches=[Integer(23); Integer(25); Integer(30)]}  

Under GATE Embedded, this option is available by calling gate.Document’s toXml(Set aSetContainingAnnotations) method. This method returns a string which is the XML representation of the document on which the method was called. If called with null as a parameter, then the method will attempt to restore only the original markup. If the parameter is a set that contains annotations, then each annotation is tested against the crossover restriction, and for those found to violate it, a warning will be issued and they will be discarded.

In the next subsections we will show how this option applies to the other formats supported by GATE.

5.5.3 HTML [#]

Input

HTML documents are parsed by GATE using the NekoHTML parser. The documents are read and created in GATE the same way as the XML documents.

The extensions associate with the HTML reader are:

The web server content type associate with html documents is: text/html.

The magic numbers test searches inside the document for the HTML(<html) signature.There are certain HTML documents that do not contain the HTML tag, so the magical numbers test might not hold.

There is a certain degree of customization for HTML documents in that GATE introduces new lines into the document’s text content in order to obtain a readable form. The annotations will refer the pieces of text as described in the original document but there will be a few extra new line characters inserted.

After reading H1, H2, H3, H4, H5, H6, TR, CENTER, LI, BR and DIV tags, GATE will introduce a new line (NL) char into the text. After a TITLE tag it will introduce two NLs. With P tags, GATE will introduce one NL at the beginning of the paragraph and one at the end of the paragraph. All newly added NLs are not considered to be part of the text contained by the tag.

Output

The ‘Save as XML’ option works exactly the same for all GATE’s documents so there is no particular observation to be made for the HTML formats.

When attempting to preserve the original markup formatting, GATE will generate the document in xhtml. The html document will look the same with any browser after processed by GATE but it will be in another syntax.

5.5.4 SGML [#]

Input

The SGML support in GATE is fairly light as there is no freely available Java SGML parser. GATE uses a light converter attempting to transform the input SGML file into a well formed XML. Because it does not make use of a DTD, the conversion might not be always good. It is advisable to perform a SGML2XML conversion outside the system(using some other specialized tools) before using the SGML document inside GATE.

The extensions associate with the SGML reader are:

The web server content type associate with xml documents is : text/sgml.

There is no magic numbers test for SGML.

Output

When attempting to preserve the original markup formatting, GATE will generate the document as XML because the real input of a SGML document inside GATE is an XML one.

5.5.5 Plain text [#]

Input

When reading a plain text document, GATE attempts to detect its paragraphs and add ‘paragraph’ annotations to the document’s ‘Original markups’ annotation set. It does that by detecting two consecutive NLs. The procedure works for both UNIX like or DOS like text files.

Example:

If the plain text read is as follows:

Paragraph 1. This text belongs to the first paragraph.  
 
Paragraph 2. This text belongs to the second paragraph

then two ‘paragraph’ type annotation will be created in the ‘Original markups’ annotation set (referring the first and second paragraphs ) with an empty feature map.

The extensions associate with the plain text reader are:

The web server content type associate with plain text documents is: text/plain.

There is no magic numbers test for plain text.

Output

When attempting to preserve the original markup formatting, GATE will dump XML markup that surrounds the text refereed.

The procedure described above applies both for plain text and RTF documents.

5.5.6 RTF [#]

Input

Accessing RTF documents is performed by using the Java’s RTF editor kit. It only extracts the document’s text content from the RTF document.

The extension associate with the RTF reader is ‘rtf’.

The web server content type associate with xml documents is : text/rtf.

The magic numbers test searches for {\\rtf1.

Output

Same as the plain tex output.

5.5.7 Email [#]

Input

GATE is able to read email messages packed in one document (UNIX mailbox format). It detects multiple messages inside such documents and for each message it creates annotations for all the fields composing an e-mail, like date, from, to, subject, etc. The message’s body is analyzed and a paragraph detection is performed (just like in the plain text case) . All annotation created have as type the name of the e-mail’s fields and they are placed in the Original markup annotation set.

Example:

From someone@zzz.zzz.zzz Wed Sep  6 10:35:50 2000  
 
Date: Wed, 6 Sep2000 10:35:49 +0100 (BST)  
 
From: forename1 surname2 <someone1@yyy.yyy.xxx>  
 
To: forename2 surname2 <someone2@ddd.dddd.dd.dd>  
 
Subject: A subject  
 
Message-ID: <Pine.SOL.3.91.1000906103251.26010A-100000@servername>  
MIME-Version: 1.0  
Content-Type: TEXT/PLAIN; charset=US-ASCII  
 
This text belongs to the e-mail body....  
 
This is a paragraph in the body of the e-mail  
 
This is another paragraph.

GATE attempts to detect lines such as ‘From someone@zzz.zzz.zzz Wed Sep 6 10:35:50 2000’ in the e-mail text. Those lines separate e-mail messages contained in one file. After that, for each field in the e-mail message annotations are created as follows:

The annotation type will be the name of the field, the feature map will be empty and the annotation will span from the end of the field until the end of the line containing the e-mail field.

Example:

 
a1.type = "date" a1 spans between the two ^ ^. Date:^ Wed,  
6Sep2000 10:35:49 +0100 (BST)^  
 
a2.type = "from"; a2 spans between the two ^ ^. From:^ forename1  
surname2 <someone1@yyy.yyy.xxx>^

The extensions associated with the email reader are:

The web server content type associate with plain text documents is: text/email.

The magic numbers test searches for keywords like Subject:,etc.

Output

Same as plain text output.

5.6 XML Input/Output [#]

Support for input from and output to XML is described in Section 5.5.2. In short:

When using GATE Embedded, object representations of XML documents such as DOM or jDOM, or query and transformation languages such as X-Path or XSLT, may be used in parallel with GATE’s own Document representation (gate.Document) without conflicts.

Chapter 6
ANNIE: a Nearly-New Information Extraction System [#]

And so the time had passed predictably and soberly enough in work and routine chores, and the events of the previous night from first to last had faded; and only now that both their days’ work was over, the child asleep and no further disturbance anticipated, did the shadowy figures from the masked ball, the melancholy stranger and the dominoes in red, revive; and those trivial encounters became magically and painfully interfused with the treacherous illusion of missed opportunities. Innocent yet ominous questions and vague ambiguous answers passed to and fro between them; and, as neither of them doubted the other’s absolute candour, both felt the need for mild revenge. They exaggerated the extent to which their masked partners had attracted them, made fun of the jealous stirrings the other revealed, and lied dismissively about their own. Yet this light banter about the trivial adventures of the previous night led to more serious discussion of those hidden, scarcely admitted desires which are apt to raise dark and perilous storms even in the purest, most transparent soul; and they talked about those secret regions for which they felt hardly any longing, yet towards which the irrational wings of fate might one day drive them, if only in their dreams. For however much they might belong to one another heart and soul, they knew last night was not the first time they had been stirred by a whiff of freedom, danger and adventure.

Dream Story, Arthur Schnitzler, 1926 (pp. 4-5).

GATE was originally developed in the context of Information Extraction (IE) R&D, and IE systems in many languages and shapes and sizes have been created using GATE with the IE components that have been distributed with it (see [Maynard et al. 00] for descriptions of some of these projects).1

GATE is distributed with an IE system called ANNIE, A Nearly-New IE system (developed by Hamish Cunningham, Valentin Tablan, Diana Maynard, Kalina Bontcheva, Marin Dimitrov and others). ANNIE relies on finite state algorithms and the JAPE language (see Chapter 8).

ANNIE components form a pipeline which appears in figure 6.1.


PIC


Figure 6.1: ANNIE and LaSIE


ANNIE components are included with GATE (though the linguistic resources they rely on are generally more simple than the ones we use in-house). The rest of this chapter describes these components.

6.1 Document Reset [#]

The document reset resource enables the document to be reset to its original state, by removing all the annotation sets and their contents, apart from the one containing the document format analysis (Original Markups). An optional parameter, keepOriginalMarkupsAS, allows users to decide whether to keep the Original Markups AS or not while reseting the document. This resource is normally added to the beginning of an application, so that a document is reset before an application is rerun on that document.

6.2 Tokeniser [#]

The tokeniser splits the text into very simple tokens such as numbers, punctuation and words of different types. For example, we distinguish between words in uppercase and lowercase, and between certain types of punctuation. The aim is to limit the work of the tokeniser to maximise efficiency, and enable greater flexibility by placing the burden on the grammar rules, which are more adaptable.

6.2.1 Tokeniser Rules

A rule has a left hand side (LHS) and a right hand side (RHS). The LHS is a regular expression which has to be matched on the input; the RHS describes the annotations to be added to the AnnotationSet. The LHS is separated from the RHS by ‘>’. The following operators can be used on the LHS:

| (or)  
* (0 or more occurrences)  
? (0 or 1 occurrences)  
+ (1 or more occurrences)

The RHS uses ‘;’ as a separator, and has the following format:

{LHS} > {Annotation type};{attribute1}={value1};...;{attribute  
n}={value n}

Details about the primitive constructs available are given in the tokeniser file (DefaultTokeniser.Rules).

The following tokeniser rule is for a word beginning with a single capital letter:

‘UPPERCASE_LETTER’ ‘LOWERCASE_LETTER’* >  
  Token;orth=upperInitial;kind=word;

It states that the sequence must begin with an uppercase letter, followed by zero or more lowercase letters. This sequence will then be annotated as type ‘Token’. The attribute ‘orth’ (orthography) has the value ‘upperInitial’; the attribute ‘kind’ has the value ‘word’.

6.2.2 Token Types

In the default set of rules, the following kinds of Token and SpaceToken are possible:

Word

A word is defined as any set of contiguous upper or lowercase letters, including a hyphen (but no other forms of punctuation). A word also has the attribute ‘orth’, for which four values are defined:

Number

A number is defined as any combination of consecutive digits. There are no subdivisions of numbers.

Symbol

Two types of symbol are defined: currency symbol (e.g. ‘$’, ‘£’) and symbol (e.g. ‘&’, ‘ˆ  ’). These are represented by any number of consecutive currency or other symbols (respectively).

Punctuation

Three types of punctuation are defined: start_punctuation (e.g. ‘(’), end_punctuation (e.g. ‘)’), and other punctuation (e.g. ‘:’). Each punctuation symbol is a separate token.

SpaceToken

White spaces are divided into two types of SpaceToken - space and control - according to whether they are pure space characters or control characters. Any contiguous (and homogeneous) set of space or control characters is defined as a SpaceToken.

The above description applies to the default tokeniser. However, alternative tokenisers can be created if necessary. The choice of tokeniser is then determined at the time of text processing.

6.2.3 English Tokeniser [#]

The English Tokeniser is a processing resource that comprises a normal tokeniser and a JAPE transducer (see Chapter 8). The transducer has the role of adapting the generic output of the tokeniser to the requirements of the English part-of-speech tagger. One such adaptation is the joining together in one token of constructs like “ ’30s”, “ ’Cause”, “ ’em”, “ ’N”, “ ’S”, “ ’s”, “ ’T”, “ ’d”, “ ’ll”, “ ’m”, “ ’re”, “ ’til”, “ ve”, etc. Another task of the JAPE transducer is to convert negative constructs like “don’t” from three tokens (“don”, “ ’ “ and “t”) into two tokens (“do” and “n’t”).

The English Tokeniser should always be used on English texts that need to be processed afterwards by the POS Tagger.

6.3 Gazetteer [#]

The gazetteer lists used are plain text files, with one entry per line. Each list represents a set of names, such as names of cities, organisations, days of the week, etc.

Below is a small section of the list for units of currency:

Ecu  
European Currency Units  
FFr  
Fr  
German mark  
German marks  
New Taiwan dollar  
New Taiwan dollars  
NT dollar  
NT dollars

An index file (lists.def) is used to access these lists; for each list, a major type is specified and, optionally, a minor type 2. In the example below, the first column refers to the list name, the second column to the major type, and the third to the minor type. These lists are compiled into finite state machines. Any text tokens that are matched by these machines will be annotated with features specifying the major and minor types. Grammar rules then specify the types to be identified in particular circumstances. Each gazetteer list should reside in the same directory as the index file.

currency_prefix.lst:currency_unit:pre_amount  
currency_unit.lst:currency_unit:post_amount  
date.lst:date:specific  
day.lst:date:day

So, for example, if a specific day needs to be identified, the minor type ‘day’ should be specified in the grammar, in order to match only information about specific days; if any kind of date needs to be identified,the major type ‘date’ should be specified, to enable tokens annotated with any information about dates to be identified. More information about this can be found in the following section.

In addition, the gazetteer allows arbitrary feature values to be associated with particular entries in a single list. ANNIE does not use this capability, but to enable it for your own gazetteers, set the optional gazetteerFeatureSeparator parameter to a single character (or an escape sequence such as \t or \uNNNN) when creating a gazetteer. In this mode, each line in a .lst file can have feature values specified, for example, with the following entry in the index file:

software_company.lst:company:software

the following software_company.lst:

Red Hat&stockSymbol=RHAT  
Apple Computer&abbrev=Apple&stockSymbol=AAPL  
Microsoft&abbrev=MS&stockSymbol=MSFT

and gazetteerFeatureSeparator set to &, the gazetteer will annotate Red Hat as a Lookup with features majorType=company, minorType=software and stockSymbol=RHAT. Note that you do not have to provide the same features for every line in the file, in particular it is possible to provide extra features for some lines in the list but not others.

Here is a full list of the parameters used by the Default Gazetteer:

Init-time parameters

listsURL
A URL pointing to the index file (usually lists.def) that contains the list of pattern lists.
encoding
The character encoding to be used while reading the pattern lists.
gazetteerFeatureSeparator
The character used to add arbitrary features to gazetteer entries. See above for an example.
caseSensitive
Should the gazetteer be case sensitive during matching.

Run-time parameters

document
The document to be processed.
annotationSetName
The name for annotation set where the resulting Lookup annotations will be created.
wholeWordsOnly
Should the gazetteer only match whole words? If set to true, a string segment in the input document will only be matched if it is bordered by characters that are not letters, non spacing marks, or combining spacing marks (as identified by the Unicode standard).
longestMatchOnly
Should the gazetteer only match the longest possible string starting from any position. This parameter is only relevant when the list of lookups contains proper prefixes of other entries (e.g when both ‘Dell’ and ‘Dell Europe’ are in the lists). The default behaviour (when this parameter is set to true) is to only match the longest entry, ‘Dell Europe’ in this example. This is the default GATE gazetteer behaviour since version 2.0. Setting this parameter to false will cause the gazetteer to match all possible prefixes.

6.4 Sentence Splitter [#]

The sentence splitter is a cascade of finite-state transducers which segments the text into sentences. This module is required for the tagger. The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.

Each sentence is annotated with the type Sentence. Each sentence break (such as a full stop) is also given a ‘Split’ annotation. This has several possible types: ‘.’, ‘punctuation’, ‘CR’ (a line break) or ‘multi’ (a series of punctuation marks such as ‘?!?!’.

The sentence splitter is domain and application-independent.

There is an alternative ruleset for the Sentence Splitter which considers newlines and carriage returns differently. In general this version should be used when a new line on the page indicates a new sentence). To use this alternative version, simply load the main-single-nl.jape from the default location instead of main.jape (the default file) when asked to select the location of the grammar file to be used.

6.5 RegEx Sentence Splitter [#]

The RegEx sentence splitter is an alternative to the standard ANNIE Sentence Splitter. Its main aim is to address some performance issues identified in the JAPE-based splitter, mainly do to with improving the execution time and robustness, especially when faced with irregular input.

As its name suggests, the RegEx splitter is based on regular expressions, using the default Java implementation.

The new splitter is configured by three files containing (Java style, see http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html) regular expressions, one regex per line. The three different files encode patterns for:

internal splits
sentence splits that are part of the sentence, such as sentence ending punctuation;
external splits
sentence splits that are NOT part of the sentence, such as 2 consecutive new lines;
non splits
text fragments that might be seen as splits but they should be ignored (such as full stops occurring inside abbreviations).

The new splitter comes with an initial set of patterns that try to emulate the behaviour of the original splitter (apart from the situations where the original one was obviously wrong, like not allowing sentences to start with a number).

Here is a full list of the parameters used by the RegEx Sentence Splitter:

Init-time parameters

encoding
The character encoding to be used while reading the pattern lists.
externalSplitListURL
URL for the file containing the list of external split patterns;
internalSplitListURL
URL for the file containing the list of internal split patterns;
nonSplitListURL
URL for the file containing the list of non split patterns;

Run-time parameters

document
The document to be processed.
outputASName
The name for annotation set where the resulting Split and Sentence annotations will be created.

6.6 Part of Speech Tagger [#]

The tagger [Hepple 00] is a modified version of the Brill tagger, which produces a part-of-speech tag as an annotation on each word or symbol. The list of tags used is given in Appendix G. The tagger uses a default lexicon and ruleset (the result of training on a large corpus taken from the Wall Street Journal). Both of these can be modified manually if necessary. Two additional lexicons exist - one for texts in all uppercase (lexicon_cap), and one for texts in all lowercase (lexicon_lower). To use these, the default lexicon should be replaced with the appropriate lexicon at load time. The default ruleset should still be used in this case.

The ANNIE Part-of-Speech tagger requires the following parameters.

If - (inputASName == outputASName) AND (outputAnnotationType == baseTokenAnnotationType)

then - New features are added on existing annotations of type ‘baseTokenAnnotationType’.

otherwise - Tagger searches for the annotation of type ‘outputAnnotationType’ under the ‘outputASName’ annotation set that has the same offsets as that of the annotation with type ‘baseTokenAnnotationType’. If it succeeds, it adds new feature on a found annotation, and otherwise, it creates a new annotation of type ‘outputAnnotationType’ under the ‘outputASName’ annotation set.

6.7 Semantic Tagger [#]

ANNIE’s semantic tagger is based on the JAPE language – see Chapter 8. It contains rules which act on annotations assigned in earlier phases, in order to produce outputs of annotated entities.

6.8 Orthographic Coreference (OrthoMatcher) [#]

(Note: this component was previously known as a ‘NameMatcher’.)

The Orthomatcher module adds identity relations between named entities found by the semantic tagger, in order to perform coreference. It does not find new named entities as such, but it may assign a type to an unclassified proper name, using the type of a matching name.

The matching rules are only invoked if the names being compared are both of the same type, i.e. both already tagged as (say) organisations, or if one of them is classified as ‘unknown’. This prevents a previously classified name from being recategorised.

6.8.1 GATE Interface

Input – entity annotations, with an id attribute.

Output – matches attributes added to the existing entity annotations.

6.8.2 Resources

A lookup table of aliases is used to record non-matching strings which represent the same entity, e.g. ‘IBM’ and ‘Big Blue’, ‘Coca-Cola’ and ‘Coke’. There is also a table of spurious matches, i.e. matching strings which do not represent the same entity, e.g. ‘BT Wireless’ and ‘BT Cellnet’ (which are two different organizations). The list of tables to be used is a load time parameter of the orthomatcher: a default list is set but can be changed as necessary.

6.8.3 Processing

The wrapper builds an array of the strings, types and IDs of all name annotations, which is then passed to a string comparison function for pairwise comparisons of all entries.

6.9 Pronominal Coreference [#]

The pronominal coreference module performs anaphora resolution using the JAPE grammar formalism. Note that this module is not automatically loaded with the other ANNIE modules, but can be loaded separately as a Processing Resource. The main module consists of three submodules:

The first two modules are helper submodules for the pronominal one, because they do not perform anything related to coreference resolution except the location of quoted fragments and pleonastic it occurrences in text. They generate temporary annotations which are used by the pronominal submodule (such temporary annotations are removed later).

The main coreference module can operate successfully only if all ANNIE modules were already executed. The module depends on the following annotations created from the respective ANNIE modules:

For each pronoun (anaphor) the coreference module generates an annotation of type ‘Coreference’ containing two features:

6.9.1 Quoted Speech Submodule

The quoted speech submodule identifies quoted fragments in the text being analysed. The identified fragments are used by the pronominal coreference submodule for the proper resolution of pronouns such as I, me, my, etc. which appear in quoted speech fragments. The module produces ‘Quoted Text’ annotations.

The submodule itself is a JAPE transducer which loads a JAPE grammar and builds an FSM over it. The FSM is intended to match the quoted fragments and generate appropriate annotations that will be used later by the pronominal module.

The JAPE grammar consists of only four rules, which create temporary annotations for all punctuation marks that may enclose quoted speech, such as ”, ’, ‘, etc. These rules then try to identify fragments enclosed by such punctuation. Finally all temporary annotations generated during the processing, except the ones of type ‘Quoted Text’, are removed (because no other module will need them later).

6.9.2 Pleonastic It Submodule

The pleonastic it submodule matches pleonastic occurrences of ‘it’. Similar to the quoted speech submodule, it is a JAPE transducer operating with a grammar containing patterns that match the most commonly observed pleonastic it constructs.

6.9.3 Pronominal Resolution Submodule

The main functionality of the coreference resolution module is in the pronominal resolution submodule. This uses the result from the execution of the quoted speech and pleonastic it submodules. The module works according to the following algorithm:

6.9.4 Detailed Description of the Algorithm

Full details of the pronominal coreference algorithm are as follows.

Preprocessing

The preprocessing task includes the following subtasks:

Pronoun Resolution

This task includes the following subtasks:

Retrieving all the pronouns in the document. Pronouns are represented as annotations of type ‘Token’ with feature ‘category’ having value ‘PRP$’ or ‘PRP’. The former classifies possessive adjectives such as my, your, etc. and the latter classifies personal, reflexive etc. pronouns. The two types of pronouns are combined in one list and sorted according to their offset in the text.

For each pronoun in the list the following actions are performed:

Coreference Chain Generation

This step is actually performed by the main module. After executing each of the submodules on the current document, the coreference module follows the steps:

The resolution of she, her, her$, he, him, his, herself and himself are similar because an analysis of a corpus showed that these pronouns are related to their antecedents in a similar manner. The characteristics of the resolution process are:

The resolution process performs the following steps:

Resolution of ‘it’, ‘its’, ‘itself’

This set of pronouns also shares many common characteristics. The resolution process contains certain differences with the one for the previous set of pronouns. Successful resolution for it, its, itself is more difficult because of the following factors:

Resolution of ‘I’, ‘me’, ‘my’, ‘myself’

Resolution of these pronouns is dependent on the work of the quoted speech submodule. One important difference from the resolution process of other pronouns is that the context is not measured in sentences but depends solely on the quote span. Another difference is that the context is not contiguous - the quoted fragment itself is excluded from the context, because it is unlikely that an antecedent for I, me, etc. appears there. The context itself consists of:

It is worth noting that contrary to other pronouns, the antecedent for I, me, my and myself is most often cataphoric or if anaphoric it is not in the same sentence with the quoted fragment.

The resolution algorithm consists of the following steps:

6.10 A Walk-Through Example [#]

Let us take an example of a 3-stage procedure using the tokeniser, gazetteer and named-entity grammar. Suppose we wish to recognise the phrase ‘800,000 US dollars’ as an entity of type ‘Number’, with the feature ‘money’.

First of all, we give an example of a grammar rule (and corresponding macros) for money, which would recognise this type of pattern.

 
Macro: MILLION_BILLION  
({Token.string == "m"}|  
{Token.string == "million"}|  
{Token.string == "b"}|  
{Token.string == "billion"}  
)  
 
Macro: AMOUNT_NUMBER  
({Token.kind == number}  
(({Token.string == ","}|  
  {Token.string == "."})  
{Token.kind == number})*  
(({SpaceToken.kind == space})?  
 (MILLION_BILLION)?)  
)  
 
Rule: Money1  
// e.g. 30 pounds  
  (  
      (AMOUNT_NUMBER)  
      (SpaceToken.kind == space)?  
      ({Lookup.majorType == currency_unit})  
  )  
 :money  -->  
  :money.Number = {kind = "money", rule = "Money1"}

6.10.1 Step 1 - Tokenisation

The tokeniser separates this phrase into the following tokens. In general, a word is comprised of any number of letters of either case, including a hyphen, but nothing else; a number is composed of any sequence of digits; punctuation is recognised individually (each character is a separate token), and any number of consecutive spaces and/or control characters are recognised as a single spacetoken.

Token, string = ‘800’, kind = number, length = 3  
Token, string = ‘,’, kind = punctuation, length = 1  
Token, string = ‘000’, kind = number, length = 3  
SpaceToken, string = ‘ ’, kind = space, length = 1  
Token, string = ‘US’, kind = word, length = 2, orth = allCaps  
SpaceToken, string = ‘ ’, kind = space, length = 1  
Token, string = ‘dollars’, kind = word, length = 7, orth = lowercase

6.10.2 Step 2 - List Lookup

The gazetteer lists are then searched to find all occurrences of matching words in the text. It finds the following match for the string ‘US dollars’:

Lookup, minorType = post_amount, majorType = currency_unit

6.10.3 Step 3 - Grammar Rules

The grammar rule for money is then invoked. The macro MILLION_BILLION recognises any of the strings ‘m’, ‘million’, ‘b’, ‘billion’. Since none of these exist in the text, it passes onto the next macro. The AMOUNT_NUMBER macro recognises a number, optionally followed by any number of sequences of the form‘dot or comma plus number’, followed by an optional space and an optional MILLION_BILLION. In this case, ‘800,000’ will be recognised. Finally, the rule Money1 is invoked. This recognises the string identified by the AMOUNT_NUMBER macro, followed by an optional space, followed by a unit of currency (as determined by the gazetteer). In this case, ‘US dollars’ has been identified as a currency unit, so the rule Money1 recognises the entire string ‘800,000 US dollars’. Following the rule, it will be annotated as a Number entity of type Money:

 Number, kind = money, rule = Money1 

Part II
GATE for Advanced Users [#]

Chapter 7
GATE Embedded [#]

7.1 Quick Start with GATE Embedded [#]

Embedding GATE-based language processing in other applications using GATE Embedded (the GATE API) is straightforward:

For example, this code will create the ANNIE extraction system:

 
1  // initialise the GATE library 
2  Gate.init(); 
3 
4  // load ANNIE as an application from a gapp file 
5  SerialAnalyserController controller = (SerialAnalyserController) 
6    PersistenceManager.loadObjectFromFile(new File(new File( 
7      Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR), 
8        ANNIEConstants.DEFAULT_FILE));

If you want to use resources from any plugins, you need to load the plugins before calling createResource:

 
1  Gate.init(); 
2 
3  // need Tools plugin for the Morphological analyser 
4  Gate.getCreoleRegister().registerDirectories( 
5    new File(Gate.getPluginsHome(), "Tools").toURL() 
6  ); 
7 
8  ... 
9 
10  ProcessingResource morpher = (ProcessingResource) 
11    Factory.createResource("gate.creole.morph.Morph");

Instead of creating your processing resources individually using the Factory, you can create your application in GATE Developer, save it using the ‘save application state’ option (see Section 3.8.3), and then load the saved state from your code. This will automatically reload any plugins that were loaded when the state was saved, you do not need to load them manually.

 
1  Gate.init(); 
2 
3  CorpusController controller = (CorpusController) 
4    PersistenceManager.loadObjectFromFile(new File("savedState.xgapp")); 
5 
6  // loadObjectFromUrl is also available

There are many examples of using GATE Embedded available at http://gate.ac.uk/wiki/code-repository/.

7.2 Resource Management in GATE Embedded [#]

As outlined earlier, GATE defines three different types of resources:

Language Resources
: (LRs) entities that hold linguistic data.
Processing Resources
: (PRs) entities that process data.
Visual Resources
: (VRs) components used for building graphical interfaces.

These resources are collectively named CREOLE1 resources.

All CREOLE resources have some associated meta-data in the form of an entry in a special XML file named creole.xml. The most important role of that meta-data is to specify the set of parameters that a resource understands, which of them are required and which not, if they have default values and what those are. The valid parameters for a resource are described in the resource’s section of its creole.xml file or in Java annotations on the resource class – see Section 4.7.

All resource types have creation-time parameters that are used during the initialisation phase. Processing Resources also have run-time parameters that get used during execution (see Section 7.5 for more details).

Controllers are used to define GATE applications and have the role of controlling the execution flow (see Section 7.6 for more details).

This section describes how to create and delete CREOLE resources as objects in a running Java virtual machine. This process involves using GATE’s Factory class2, and, in the case of LRs, may also involve using a DataStore.

CREOLE resources are Java Beans; creation of a resource object involves using a default constructor, then setting parameters on the bean, then calling an init() method. The Factory takes care of all this, makes sure that the GATE Developer GUI is told about what is happening (when GUI components exist at runtime), and also takes care of restoring LRs from DataStores. A programmer using GATE Embedded should never call the constructor of a resource: always use the Factory!

Creating a resource involves providing the following information:

  Parameters and features need to be provided in the form of a GATE Feature Map which is essentially a java Map (java.util.Map) implementation, see Section 7.4.2 for more details on Feature Maps.

Creating a resource via the Factory involves passing values for any create-time parameters that require setting to the Factory’s createResource method. If no parameters are passed, the defaults are used. So, for example, the following code creates a default ANNIE part-of-speech tagger:

 
1Gate.getCreoleRegister().registerDirectories(new File( 
2  Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR).toURI().toURL()); 
3FeatureMap params = Factory.newFeatureMap(); // empty map: default parameters 
4ProcessingResource tagger = (ProcessingResource) 
5  Factory.createResource("gate.creole.POSTagger", params);

Note that if the resource created here had any parameters that were both mandatory and had no default value, the createResource call would throw an exception. In this case, all the information needed to create a tagger is available in default values given in the tagger’s XML definition (in plugins/ANNIE/creole.xml):

<RESOURCE>  
  <NAME>ANNIE POS Tagger</NAME>  
  <COMMENT>Mark Hepple’s Brill-style POS tagger</COMMENT>  
  <CLASS>gate.creole.POSTagger</CLASS>  
  <PARAMETER NAME="document"  
    COMMENT="The document to be processed"  
    RUNTIME="true">gate.Document</PARAMETER>  
....  
  <PARAMETER NAME="rulesURL" DEFAULT="resources/heptag/ruleset"  
    COMMENT="The URL for the ruleset file"  
    OPTIONAL="true">java.net.URL</PARAMETER>  
</RESOURCE>

Here the two parameters shown are either ‘runtime’ parameters, which are set before a PR is executed, or have a default value (in this case the default rules file is distributed with GATE itself).

When creating a Document, however, the URL of the source for the document must be provided3. For example:

 
1URL u = new URL("http://gate.ac.uk/hamish/"); 
2FeatureMap params = Factory.newFeatureMap(); 
3params.put("sourceUrl", u); 
4Document doc = (Document) 
5  Factory.createResource("gate.corpora.DocumentImpl", params);

Note that the document created here is transient: when you quit the JVM the document will no longer exist. If you want the document to be persistent, you need to store it in a DataStore (see Section 7.4.5).

Apart from createResource() methods with different signatures, Factory also provides some shortcuts for common operations, listed in table 7.1.




Method

Purpose



newFeatureMap()

Creates a new Feature Map (as used in the example above).



newDocument(String content)

Creates a new GATE Document starting from a String value that will be used to generate the document content.



newDocument(URL sourceUrl)

Creates a new GATE Document using the text pointed by an URL to generate the document content.



newDocument(URL sourceUrl, String encoding)

Same as above but allows the specification of an encoding to be used while downloading the document content.



newCorpus(String name)

creates a new GATE Corpus with a specified name.




Table 7.1: Factory Operations

GATE maintains various data structures that allow the retrieval of loaded resources. When a resource is no longer required, it needs to be removed from those structures in order to remove all references to it, thus making it a candidate for garbage collection. This is achieved using the  deleteResource(Resource res) method on Factory.

Simply removing all references to a resource from the user code will NOT be enough to make the resource collect-able. Not calling  Factory.deleteResource() will lead to memory leaks!

7.3 Using CREOLE Plugins [#]

As shown in the examples above, in order to use a CREOLE resource the relevant CREOLE plugin must be loaded. Processing Resources, Visual Resources and Language Resources other than Document, Corpus and DataStore all require that the appropriate plugin is first loaded. When using Document, Corpus or DataStore, you do not need to first load a plugin. The following API calls listed in table 7.2 are relevant to working with CREOLE plugins.




Class gate.Gate




Method

Purpose



public static void addKnownPlugin(URL pluginURL)

adds the plugin to the list of known plugins.



public static void removeKnownPlugin(URL pluginURL)

tells the system to ‘forget’ about one previously known directory. If the specified directory was loaded, it will be unloaded as well - i.e. all the metadata relating to resources defined by this directory will be removed from memory.



public static void addAutoloadPlugin(URL pluginUrl)

adds a new directory to the list of plugins that are loaded automatically at start-up.



public static void removeAutoloadPlugin(URL pluginURL)

tells the system to remove a plugin URL from the list of plugins that are loaded automatically at system start-up. This will be reflected in the user’s configuration data file.



Class gate.CreoleRegister




public void registerDirectories(URL directoryUrl)

loads a new CREOLE directory. The new plugin is added to the list of known plugins if not already there.



public void removeDirectory(URL directory)

unloads a loaded CREOLE plugin.




Table 7.2: Calls Relevant to CREOLE Plugins

7.4 Language Resources [#]

This section describes the implementation of documents and corpora in GATE.

7.4.1 GATE Documents

Documents are modelled as content plus annotations (see Section 7.4.4) plus features (see Section 7.4.2).

The content of a document can be any implementation of the  gate.DocumentContent interface; the features are <attribute, value> pairs stored a Feature Map. Attributes are String values while the values can be any Java object.

The annotations are grouped in sets (see section 7.4.3). A document has a default (anonymous) annotations set and any number of named annotations sets.

Documents are defined by the gate.Document interface and there is also a provided implementation:

gate.corpora.DocumentImpl
: transient document. Can be stored persistently through Java serialisation.

Main Document functions are presented in table 7.3.




Content Manipulation




Method

Purpose



DocumentContent getContent()

Gets the Document content.



void edit(Long start, Long end, DocumentContent replacement)

Modifies the Document content.



void setContent(DocumentContent newContent)

Replaces the entire content.



Annotations Manipulation




Method

Purpose



public AnnotationSet getAnnotations()

Returns the default annotation set.



public AnnotationSet getAnnotations(String name)

Returns a named annotation set.



public Map getNamedAnnotationSets()

Returns all the named annotation sets.



void removeAnnotationSet(String name)

Removes a named annotation set.



Input Output




String toXml()

Serialises the Document in XML format.



String toXml(Set aSourceAnnotationSet, boolean includeFeatures)

Generates XML from a set of annotations only, trying to preserve the original format of the file used to create the document.




Table 7.3: gate.Document methods.

7.4.2 Feature Maps [#]

All CREOLE resources as well as the Controllers and the annotations can have attached meta-data in the form of Feature Maps.

A Feature Map is a Java Map (i.e. it implements the java.util.Map interface) and holds <attribute-name, attribute-value> pairs. The attribute names are Strings while the values can be any Java Objects.

The use of non-Serialisable objects as values is strongly discouraged.

Feature Maps are created using the gate.Factory.newFeatureMap() method.

The actual implementation for FeatureMaps is provided by the  gate.util.SimpleFeatureMapImpl class.

Objects that have features in GATE implement the gate.util.FeatureBearer interface which has only the two accessor methods for the object features: FeatureMap getFeatures() and void setFeatures(FeatureMap features).

G¯ etting a particular feature from an object

 
1Object obj; 
2String featureName = "length"; 
3if(obj instanceof FeatureBearer){ 
4  FeatureMap features = ((FeatureBearer)obj).getFeatures(); 
5  Object value = (features == null) ? null : 
6                                      features.get(featureName); 
7}

7.4.3 Annotation Sets [#]

A GATE document can have one or more annotation layers — an anonymous one, (also called default), and as many named ones as necessary.

An annotation layer is organised as a Directed Acyclic Graph (DAG) on which the nodes are particular locations —anchors— in the document content and the arcs are made out of annotations reaching from the location indicated by the start node to the one pointed by the end node (see Figure 7.1 for an illustration). Because of the graph metaphor, the annotation layers are also called annotation graphs. In terms of Java objects, the annotation layers are represented using the Set paradigm as defined by the collections library and they are hence named annotation sets. The terms of annotation layer, graph and set are interchangeable and refer to the same concept when used in this book.


PIC

Figure 7.1: The Annotation Graph model.


An annotation set holds a number of annotations and maintains a series of indices in order to provide fast access to the contained annotations.

The GATE Annotation Sets are defined by the gate.AnnotationSet interface and there is a default implementation provided:

gate.annotation.AnnotationSetImpl
annotation set implementation used by transient documents.

The annotation sets are created by the document as required. The first time a particular annotation set is requested from a document it will be transparently created if it doesn’t exist.

Tables 7.4 and 7.5 list the most used Annotation Set functions.




Annotations Manipulation




Method

Purpose



Integer add(Long start, Long end, String type, FeatureMap features)

Creates a new annotation between two offsets, adds it to this set and returns its id.



Integer add(Node start, Node end, String type, FeatureMap features)

Creates a new annotation between two nodes, adds it to this set and returns its id.



boolean remove(Object o)

Removes an annotation from this set.



Nodes




Method

Purpose



Node firstNode()

Gets the node with the smallest offset.



Node lastNode()

Gets the node with the largest offset.



Node nextNode(Node node)

Get the first node that is relevant for this annotation set and which has the offset larger than the one of the node provided.



Set implementation




Iterator iterator()



int size()




Table 7.4: gate.AnnotationSet methods (general purpose).




Searching




AnnotationSet get(Long offset)

Select annotations by offset. This returns the set of annotations whose start node is the least such that it is less than or equal to offset. If a positional index doesn’t exist it is created. If there are no nodes at or beyond the offset parameter then it will return null.



AnnotationSet get(Long startOffset, Long endOffset)

Select annotations by offset. This returns the set of annotations that overlap totally or partially with the interval defined by the two provided offsets. The result will include all the annotations that either:

  • start before the start offset and end strictly after it
  • start at a position between the start and the end offsets



AnnotationSet get(String type)

Returns all annotations of the specified type.



AnnotationSet get(Set types)

Returns all annotations of the specified types.



AnnotationSet get(String type, FeatureMap constraints)

Selects annotations by type and features.



Set getAllTypes()

Gets a set of java.lang.String objects representing all the annotation types present in this annotation set.




Table 7.5: gate.AnnotationSet methods (searching).

I
¯ terating from left to right over all annotations of a given type

 

 
1AnnotationSet annSet = ...; 
2String type = "Person"; 
3//Get all person annotations 
4AnnotationSet persSet = annSet.get(type); 
5//Sort the annotations 
6List persList = new ArrayList(persSet); 
7Collections.sort(persList, new gate.util.OffsetComparator()); 
8//Iterate 
9Iterator persIter = persList.iterator(); 
10while(persIter.hasNext()){ 
11... 
12}

7.4.4 Annotations [#]

An annotation, is a form of meta-data attached to a particular section of document content. The connection between the annotation and the content it refers to is made by means of two pointers that represent the start and end locations of the covered content. An annotation must also have a type (or a name) which is used to create classes of similar annotations, usually linked together by their semantics.

An Annotation is defined by:

start node
a location in the document content defined by an offset.
end node
a location in the document content defined by an offset.
type
a String value.
features
(see Section 7.4.2).
ID
an Integer value. All annotations IDs are unique inside an annotation set.

In GATE Embedded, annotations are defined by the gate.Annotation interface and implemented by the gate.annotation.AnnotationImpl class. Annotations exist only as members of annotation sets (see Section 7.4.3) and they should not be directly created by means of a constructor. Their creation should always be delegated to the containing annotation set.

7.4.5 GATE Corpora [#]

A corpus in GATE is a Java List (i.e. an implementation of java.util.List) of documents. GATE corpora are defined by the gate.Corpus interface and the following implementations are available:

gate.corpora.CorpusImpl
used for transient corpora.
gate.corpora.SerialCorpusImpl
used for persistent corpora that are stored in a serial datastore (i.e. as a directory in a file system).

Apart from implementation for the standard List methods, a Corpus also implements the methods in table 7.6.




Method

Purpose



String getDocumentName(int index)

Gets the name of a document in this corpus.



List getDocumentNames()

Gets the names of all the documents in this corpus.



void populate(URL directory, FileFilter filter, String encoding, boolean recurseDirectories)

Fills this corpus with documents created on the fly from selected files in a directory. Uses a FileFilter to select which files will be used and which will be ignored. A simple file filter based on extensions is provided in the Gate distribution (gate.util.ExtensionFileFilter).




Table 7.6: gate.Corpus methods.

Creating a corpus from all XML files in a directory

 
1Corpus corpus = Factory.newCorpus("My XML Files"); 
2File directory = ...; 
3ExtensionFileFilter filter = new ExtensionFileFilter("XML files", "xml"); 
4URL url = directory.toURL(); 
5corpus.populate(url, filter, null, false);

Using a DataStore

Assuming that you have a DataStore already open called myDataStore, this code will ask the datastore to take over persistence of your document, and to synchronise the memory representation of the document with the disk storage:

Document persistentDoc = myDataStore.adopt(doc, mySecurity);  
myDataStore.sync(persistentDoc);

When you want to restore a document (or other LR) from a datastore, you make the same createResource call to the Factory as for the creation of a transient resource, but this time you tell it the datastore the resource came from, and the ID of the resource in that datastore:

 
1  URL u = ....; // URL of a serial datastore directory 
2  SerialDataStore sds = new SerialDataStore(u.toString()); 
3  sds.open(); 
4 
5  // getLrIds returns a list of LR Ids, so we get the first one 
6  Object lrId = sds.getLrIds("gate.corpora.DocumentImpl").get(0); 
7 
8  // we need to tell the factory about the LRs ID in the data 
9  // store, and about which datastore it is in - we do this 
10  // via a feature map: 
11  FeatureMap features = Factory.newFeatureMap(); 
12  features.put(DataStore.LR_ID_FEATURE_NAME, lrId); 
13  features.put(DataStore.DATASTORE_FEATURE_NAME, sds); 
14 
15  // read the document back 
16  Document doc = (Document) 
17    Factory.createResource("gate.corpora.DocumentImpl", features);

7.5 Processing Resources [#]

Processing Resources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or ngram modellers.

They are created using the GATE Factory in manner similar the Language Resources. Besides the creation-time parameters they also have a set of run-time parameters that are set by the system just before executing them.

Analysers are a particular type of processing resources in the sense that they always have a document and a corpus among their run-time parameters.

The most used methods for Processing Resources are presented in table 7.7




Method

Purpose



void setParameterValue(String paramaterName, Object parameterValue)

Sets the value for a specified parameter. method inherited from gate.Resource



void setParameterValues(FeatureMap parameters)

Sets the values for more parameters in one step. method inherited from gate.Resource



Object getParameterValue(String paramaterName)

Gets the value of a named parameter of this resource. method inherited from gate.Resource



Resource init()

Initialise this resource, and return it. method inherited from gate.Resource



void reInit()

Reinitialises the processing resource. After calling this method the resource should be in the state it is after calling init. If the resource depends on external resources (such as rules files) then the resource will re-read those resources. If the data used to create the resource has changed since the resource has been created then the resource will change too after calling reInit().



void execute()

Starts the execution of this Processing Resource.



void interrupt()

Notifies this PR that it should stop its execution as soon as possible.



boolean isInterrupted()

Checks whether this PR has been interrupted since the last time its Executable.execute() method was called.




Table 7.7: gate.ProcessingResource methods.

7.6 Controllers [#]

Controllers are used to create GATE applications. A Controller handles a set of Processing Resources and can execute them following a particular strategy. GATE provides a series of serial controllers (i.e. controllers that run their PRs in sequence):

gate.creole.SerialController:
a serial controller that takes any kind of PRs.
gate.creole.SerialAnalyserController:
a serial controller that only accepts Language Analysers as member PRs.
gate.creole.ConditionalSerialController:
a serial controller that accepts all types of PRs and that allows the inclusion or exclusion of member PRs from the execution chain according to certain run-time conditions (currently features on the document being processed are used).
gate.creole.ConditionalSerialAnalyserController:
a serial controller that only accepts Language Analysers and that allows the conditional run of member PRs.

C
¯ reating an ANNIE application and running it over a corpus

 
1// load the ANNIE plugin 
2Gate.getCreoleRegister().registerDirectories(new File( 
3 Gate.getPluginsHome(), "ANNIE").toURI().toURL()); 
4 
5// create a serial analyser controller to run ANNIE with 
6SerialAnalyserController annieController = 
7 (SerialAnalyserController) Factory.createResource( 
8     "gate.creole.SerialAnalyserController", 
9     Factory.newFeatureMap(), 
10     Factory.newFeatureMap(), "ANNIE"); 
11 
12// load each PR as defined in ANNIEConstants 
13for(int i = 0; i < ANNIEConstants.PR_NAMES.length; i++) { 
14  // use default parameters 
15  FeatureMap params = Factory.newFeatureMap(); 
16  ProcessingResource pr = (ProcessingResource) 
17      Factory.createResource(ANNIEConstants.PR_NAMES[i], 
18                             params); 
19  // add the PR to the pipeline controller 
20  annieController.add(pr); 
21} // for each ANNIE PR 
22 
23// Tell ANNIEs controller about the corpus you want to run on 
24Corpus corpus = ...; 
25annieController.setCorpus(corpus); 
26// Run ANNIE 
27annieController.execute();

7.7 Persistent Applications [#]

GATE Embedded allows the persistent storage of applications in a format based on XML serialisation. This is particularly useful for applications management and distribution. A developer can save the state of an application when he/she stops working on its design and continue developing it in a next session. When the application reaches maturity it can be deployed to the client site using the same method.

When an application (i.e. a Controller) is saved, GATE will actually only save the values for the parameters used to create the Processing Resources that are contained in the application. When the application is reloaded, all the PRs will be re-created using the saved parameters.

Many PRs use external resources (files) to define their behaviour and, in most cases, these files are identified using URLs. During the saving process, all the URLs are converted relative URLs based on the location of the application file. This way, if the resources are packaged together with the application file, the entire application can be reliably moved to a different location.

API access to application saving and loading is provided by means of two static methods on the gate.util.persistence.PersistenceManager class, listed in table 7.8.




Method

Purpose



public static void saveObjectToFile(Object obj, File file)

Saves the data needed to re-create the provided GATE object to the specified file. The Object provided can be any type of Language or Processing Resource or a Controller. The procedures may work for other types of objects as well (e.g. it supports most Collection types).



public static Object loadObjectFromFile(File file)

Parses the file specified (which needs to be a file created by the above method) and creates the necessary object(s) as specified by the data in the file. Returns the root of the object tree.




Table 7.8: Application Saving and Loading

S
¯ aving and loading a GATE application

 
1//Where to save the application? 
2File file = ...; 
3//What to save? 
4Controller theApplication = ...; 
5 
6//save 
7gate.util.persistence.PersistenceManager. 
8          saveObjectToFile(theApplication, file); 
9//delete the application 
10Factory.deleteResource(theApplication); 
11theApplication = null; 
12 
13[...] 
14//load the application back 
15theApplication = gate.util.persistence.PersistenceManager. 
16                 loadObjectFromFile(file);

7.8 Ontologies

Starting from GATE version 3.1, support for ontologies has been added. Ontologies are nominally Language Resources but are quite different from documents and corpora and are detailed in chapter 14.

Classes related to ontologies are to be found in the gate.creole.ontology package and its sub-packages. The top level package defines an abstract API for working with ontologies while the sub-packages contain concrete implementations. A client program should only use the classes and methods defined in the API and never any of the classes or methods from the implementation packages.

The entry point to the ontology API is the gate.creole.ontology.Ontology interface which is the base interface for all concrete implementations. It provides methods for accessing the class hierarchy, listing the instances and the properties.

Ontology implementations are available through plugins. Before an ontology language resource can be created using the gate.Factory and before any of the classes and methods in the API can be used, one of the implementing ontology plugins must be loaded. For details see chapter 14.

7.9 Creating a New Annotation Schema [#]

An annotation schema (see Section 3.4.6) can be brought inside GATE through the creole.xml file. By using the AUTOINSTANCE element, one can create instances of resources defined in creole.xml. The gate.creole.AnnotationSchema (which is the Java representation of an annotation schema file) initializes with some predefined annotation definitions (annotation schemas) as specified by the GATE team.

Example from GATE’s internal creole.xml (in src/gate/resources/creole):

<!-- Annotation schema -->  
<RESOURCE>  
  <NAME>Annotation schema</NAME>  
  <CLASS>gate.creole.AnnotationSchema</CLASS>  
  <COMMENT>An annotation type and its features</COMMENT>  
  <PARAMETER NAME="xmlFileUrl" COMMENT="The url to the definition file"  
    SUFFIXES="xml;xsd">java.net.URL</PARAMETER>  
  <AUTOINSTANCE>  
    <PARAM NAME ="xmlFileUrl" VALUE="schema/AddressSchema.xml" />  
  </AUTOINSTANCE>  
  <AUTOINSTANCE>  
    <PARAM NAME ="xmlFileUrl" VALUE="schema/DateSchema.xml" />  
  </AUTOINSTANCE>  
  <AUTOINSTANCE>  
    <PARAM NAME ="xmlFileUrl" VALUE="schema/FacilitySchema.xml" />  
  </AUTOINSTANCE>  
  <!-- etc. -->  
</RESOURCE>

In order to create a gate.creole.AnnotationSchema object from a schema annotation file, one must use the gate.Factory class;

 
1FeatureMap params = new FeatureMap();\\ 
2param.put("xmlFileUrl",annotSchemaFile.toURL());\\ 
3AnnotationSchema annotSchema = \\ 
4Factory.createResurce("gate.creole.AnnotationSchema", params);

Note: All the elements and their values must be written in lower case, as XML is defined as case sensitive and the parser used for XML Schema inside GATE searches is case sensitive.

In order to be able to write XML Schema definitions, the ones defined in GATE (resources/creole/schema) can be used as a model, or the user can have a look at http://www.w3.org/2000/10/XMLSchema for a proper description of the semantics of the elements used.

Some examples of annotation schemas are given in Section 5.4.1.

7.10 Creating a New CREOLE Resource [#]

To create a new resource you need to:

GATE Developer helps you with this process by creating a set of directories and files that implement a basic resource, including a Java code file and a Makefile. This process is called ‘bootstrapping’.

For example, let’s create a new component called GoldFish, which will be a Processing Resource that looks for all instances of the word ‘fish’ in a document and adds an annotation of type ‘GoldFish’.

First start GATE Developer (see Section 2.2). From the ‘Tools’


PIC


Figure 7.2: BootStrap Wizard Dialogue


menu select ‘BootStrap Wizard’, which will pop up the dialogue in figure 7.2. The meaning of the data entry fields:

Now we need to compile the class and package it into a JAR file. The bootstrap wizard creates an Ant build file that makes this very easy – so long as you have Ant set up properly, you can simply run

ant jar

This will compile the Java source code and package the resulting classes into GoldFish.jar. If you don’t have your own copy of Ant, you can use the one bundled with GATE - suppose your GATE is installed at /opt/gate-5.0-snapshot, then you can use /opt/gate-5.0-snapshot/bin/ant jar to build.

You can now load this resource into GATE; see Section 3.6. The default Java code that was created for our GoldFish resource looks like this:

 
1/* 
2 *  GoldFish.java 
3 * 
4 *  You should probably put a copyright notice here. Why not use the 
5 *  GNU licence? (See http://www.gnu.org/.) 
6 * 
7 *  hamish, 26/9/2001 
8 * 
9 *  $Id: howto.tex,v 1.130 2006/10/23 12:56:37 ian Exp $ 
10 */ 
11 
12package sheffield.creole.example; 
13 
14import java.util.*; 
15import gate.*; 
16import gate.creole.*; 
17import gate.util.*; 
18 
19/** 
20 * This class is the implementation of the resource GOLDFISH. 
21 */ 
22@CreoleResource(name = "GoldFish", 
23        comment = "Add a descriptive comment about this resource") 
24public class GoldFish extends AbstractProcessingResource 
25  implements ProcessingResource { 
26 
27 
28} // class GoldFish

The default XML configuration for GoldFish looks like this:

<!-- creole.xml GoldFish -->  
<!--  hamish, 26/9/2001 -->  
<!-- $Id: howto.tex,v 1.130 2006/10/23 12:56:37 ian Exp $ -->  
 
<CREOLE-DIRECTORY>  
  <JAR SCAN="true">GoldFish.jar</JAR>  
</CREOLE-DIRECTORY>

The directory structure containing these files


PIC


Figure 7.3: BootStrap directory tree


is shown in figure 7.3. GoldFish.java lives in the src/sheffield/creole/example directory. creole.xml and build.xml are in the top GoldFish directory. The lib directory is for libraries; the classes directory is where Java class files are placed; the doc directory is for documentation. These last two, plus GoldFish.jar are created by Ant.

This process has the advantage that it creates a complete source tree and build structure for the component, and the disadvantage that it creates a complete source tree and build structure for the component. If you already have a source tree, you will need to chop out the bits you need from the new tree (in this case GoldFish.java and creole.xml) and copy it into your existing one.

See the example code at http://gate.ac.uk/wiki/code-repository/.

7.11 Adding Support for a New Document Format [#]

In order to add a new document format, one needs to extend the gate.DocumentFormat class and to implement an abstract method called:

 
1public void unpackMarkup(Document doc) throws 
2 DocumentFormatException

This method is supposed to implement the functionality of each format reader and to create annotations on the document. Finally the document’s old content will be replaced with a new one containing only the text between markups.

If one needs to add a new textual reader will extend the gate.corpora.TextualDocumentFormat and override the unpackMarkup(doc) method.

This class needs to be implemented under the Java bean specifications because it will be instantiated by GATE using Factory.createResource() method.

The init() method that one needs to add and implement is very important because in here the reader defines its means to be selected successfully by GATE. What one needs to do is to add some specific information into certain static maps defined in DocumentFormat class, that will be used at reader detection time.

After that, a definition of the reader will be placed into the one’s creole.xml file and the reader will be available to GATE.

We present for the rest of the section a complete three step example of adding such a reader. The reader we describe in here is an XML reader.

Step 1

Create a new class called XmlDocumentFormat that extends gate.corpora.TextualDocumentFormat.

Step 2

Implement the unpackMarkup(Document doc) which performs the required functionality for the reader. Add XML detection means in init() method:

 
1public Resource init() throws ResourceInstantiationException{ 
2  // Register XML mime type 
3  MimeType mime = new MimeType("text","xml"); 
4  // Register the class handler for this mime type 
5  mimeString2ClassHandlerMap.put(mime.getType()+ "/" + mime.getSubtype(), 
6                                                                        this); 
7  // Register the mime type with mine string 
8  mimeString2mimeTypeMap.put(mime.getType() + "/" + mime.getSubtype(), mime); 
9  // Register file suffixes for this mime type 
10  suffixes2mimeTypeMap.put("xml",mime); 
11  suffixes2mimeTypeMap.put("xhtm",mime); 
12  suffixes2mimeTypeMap.put("xhtml",mime); 
13  // Register magic numbers for this mime type 
14  magic2mimeTypeMap.put("<?xml",mime); 
15  // Set the mimeType for this language resource 
16  setMimeType(mime); 
17  return this; 
18}// init()

More details about the information from those maps can be found in Section 5.5.1

Step 3

Add the following creole definition in the creole.xml document.

    <RESOURCE>  
      <NAME>My XML Document Format</NAME>  
      <CLASS>mypackage.XmlDocumentFormat</CLASS>  
      <AUTOINSTANCE/>  
      <PRIVATE/>  
    </RESOURCE>

More information on the operation of GATE’s document format analysers may be found in Section 5.5.

7.12 Using GATE Embedded in a Multithreaded Environment [#]

GATE Embedded can be used in multithreaded applications, so long as you observe a few restrictions. First, you must initialise GATE by calling Gate.init() exactly once in your application, typically in the application startup phase before any concurrent processing threads are started.

Secondly, you must not make calls that affect the global state of GATE (e.g. loading or unloading plugins) in more than one thread at a time. Again, you would typically load all the plugins your application requires at initialisation time. It is safe to create instances of resources in multiple threads concurrently.

Thirdly, it is important to note that individual GATE processing resources, language resources and controllers are by design not thread safe – it is not possible to use a single instance of a controller/PR/LR in multiple threads at the same time – but for a well written resource it should be possible to use several different instances of the same resource at once, each in a different thread. When writing your own resource classes you should bear the following in mind, to ensure that your resource will be useable in this way.

Of course, if you are writing a PR that is simply a wrapper around an external library that imposes these kinds of limitations there is only so much you can do. If your resource cannot be made safe you should document this fact clearly.

All the standard ANNIE PRs are safe when independent instances are used in different threads concurrently, as are the standard transient document, transient corpus and controller classes. A typical pattern of development for a multithreaded GATE-based application is:

7.13 Using GATE Embedded within a Spring Application [#]

GATE Embedded provides helper classes to allow GATE resources to be created and managed by the Spring framework. For Spring 2.0 or later, GATE Embedded provides a custom namespace handler that makes them extremely easy to use. To use this namespace, put the following declarations in your bean definition file:

<beans xmlns="http://www.springframework.org/schema/beans"  
       xmlns:gate="http://gate.ac.uk/ns/spring"  
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
       xsi:schemaLocation="  
         http://www.springframework.org/schema/beans  
         http://www.springframework.org/schema/beans/spring-beans.xsd  
         http://gate.ac.uk/ns/spring  
         http://gate.ac.uk/ns/spring.xsd">

You can have Spring initialise GATE:

  <gate:init gate-home="WEB-INF" user-config-file="WEB-INF/user.xml">  
    <gate:preload-plugins>  
      <value>WEB-INF/ANNIE</value>  
      <value>http://example.org/gate-plugin</value>  
    </gate:preload-plugins>  
  </gate:init>

To create a GATE resource, use the <gate:resource> element.

  <gate:resource id="sharedOntology" scope="singleton"  
          resource-class="gate.creole.ontology.owlim.OWLIMOntologyLR">  
    <gate:parameters>  
      <entry key="rdfXmlURL">  
        <value type="org.springframework.core.io.Resource"  
          >WEB-INF/ontology.rdf</value>  
      </entry>  
    </gate:parameters>  
  </gate:resource>

If you are familiar with Spring you will see that <gate:parameters> uses the same format as the standard <map> element, but values whose type is a Spring Resource will be converted to URLs before being passed to the GATE resource.

You can load a GATE saved application with

  <gate:saved-application location="WEB-INF/application.gapp" scope="prototype">  
    <gate:customisers>  
      <gate:set-parameter pr-name="custom transducer" name="ontology"  
                          ref="sharedOntology" />  
    </gate:customisers>  
  </gate:saved-application>

‘Customisers’ are used to customise the application after it is loaded. In the example above, we load a singleton copy of an ontology which is then shared between all the separate instances of the (prototype) application. The <gate:set-parameter> customiser accepts all the same ways to provide a value as the standard Spring <property> element (a ”value” or ”ref” attribute, or a sub-element - <value>, <list>, <bean>, <gate:resource> …).

The <gate:add-pr> customiser provides support for the case where most of the application is in a saved state, but we want to create one or two extra PRs with Spring (maybe to inject other Spring beans as init parameters) and add them to the pipeline.

  <gate:saved-application ...>  
    <gate:customisers>  
      <gate:add-pr add-before="OrthoMatcher" ref="myPr" />  
    </gate:customisers>  
  </gate:saved-application>

By default, the <gate:add-pr> customiser adds the target PR at the end of the pipeline, but an add-before or add-after attribute can be used to specify the name of a PR before (or after) which this PR should be placed. Alternatively, an index attribute places the PR at a specific (0-based) index into the pipeline. The PR to add can be specified either as a ‘ref’ attribute, or with a nested <bean> or <gate:resource> element.

These custom elements all define various factory beans. For full details, see the JavaDocs for gate.util.spring (the factory beans) and gate.util.spring.xml (the gate: namespace handler).

Note: the former approach using factory methods of the gate.util.spring.SpringFactory class will still work, but should be considered deprecated in favour of the new factory beans.

7.14 Using GATE Embedded within a Tomcat Web Application [#]

Embedding GATE in a Tomcat web application involves several steps.

  1. Put the necessary JAR files (gate.jar and all or most of the jars in gate/lib) in your webapp/WEB-INF/lib.
  2. Put the plugins that your application depends on in a suitable location (e.g. webapp/WEB-INF/plugins).
  3. Create suitable gate.xml configuration files for your environment.
  4. Set the appropriate paths in your application before calling Gate.init().

This process is detailed in the following sections.

7.14.1 Recommended Directory Structure

You will need to create a number of other files in your web application to allow GATE to work:

In this guide, we assume the following layout:

webapp/  
  WEB-INF/  
    gate.xml  
    user-gate.xml  
    plugins/  
      ANNIE/  
      etc.

7.14.2 Configuration Files

Your gate.xml (the ‘site-wide configuration file’) should be as simple as possible:

<?xml version="1.0" encoding="UTF-8" ?>  
<GATE>  
  <GATECONFIG Save_options_on_exit="false"  
              Save_session_on_exit="false" />  
</GATE>

Similarly, keep the user-gate.xml (the ‘user config file’) simple:

<?xml version="1.0" encoding="UTF-8" ?>  
<GATE>  
  <GATECONFIG Known_plugin_path=";"  
              Load_plugin_path=";" />  
</GATE>

This way, you can control exactly which plugins are loaded in your webapp code.

7.14.3 Initialization Code

Given the directory structure shown above, you can initialize GATE in your web application like this:

 
1// imports 
2... 
3public class MyServlet extends HttpServlet { 
4  private static boolean gateInited = false; 
5 
6  public void init() throws ServletException { 
7    if(!gateInited) { 
8      try { 
9        ServletContext ctx = getServletContext(); 
10 
11        // use /path/to/your/webapp/WEB-INF as gate.home 
12        File gateHome = new File(ctx.getRealPath("/WEB-INF")); 
13 
14        Gate.setGateHome(gateHome); 
15        // thus webapp/WEB-INF/plugins is the plugins directory, and 
16        // webapp/WEB-INF/gate.xml is the site config file. 
17 
18        // Use webapp/WEB-INF/user-gate.xml as the user config file, to avoid 
19        // confusion with your own user config. 
20        Gate.setUserConfigFile(new File(gateHome, "user-gate.xml")); 
21 
22        Gate.init(); 
23        // load plugins, for example... 
24        Gate.getCreoleRegister().registerDirectories( 
25          ctx.getResource("/WEB-INF/plugins/ANNIE")); 
26 
27        gateInited = true; 
28      } 
29      catch(Exception ex) { 
30        throw new ServletException("Exception initialising GATE", 
31                                   ex); 
32      } 
33    } 
34  } 
35}

Once initialized, you can create GATE resources using the Factory in the usual way (for example, see Section 7.1 for an example of how to create an ANNIE application). You should also read Section 7.12 for important notes on using GATE Embedded in a multithreaded application.

Instead of an initialization servlet you could also consider doing your initialization in a ServletContextListener, or using Spring (see Section 7.13).

7.15 Groovy Scripting for GATE [#]

Groovy is a dynamic programming language based on Java. You can use it as a scripting language for GATE, via the Groovy Console. Groovy is documented at http://groovy.codehaus.org/.

Groovy support is intended for users with programing skills, and with some knowledge of the GATE API. Groovy support is not enabled in GATE by default. In order to enable it, you must download and install Groovy in GATE, as follows.

  1. Download the Groovy distribution from http://groovy.codehaus.org/Download. GATE has been tested with Groovy 1.6.4
  2. Unzip the Groovy distribution - it doesn’t matter where, you do not need all of it.
  3. From the unzipped distribution, copy groovy-1.x.y/embeddable/groovy-all-1.x.y.jar in to the gate/lib directory in your GATE installation (where 1.x.y is the version number of Groovy).
  4. Groovy support will be enabled next time you start GATE.

Groovy support is currently available via a Groovy Console. To use it, open the console using the ‘Groovy Console’ item in the GATE tools menu. You can use then use the Groovy Console to write, load, and execute the Groovy language, as described in the documentation at http://groovy.codehaus.org/.

To help scripting GATE in Groovy, the following variable bindings are available from the Groovy Console.

Here’s an example script. It finds all documents with a feature ‘annotator’ set to ‘fred’, and puts them in a new corpus called ‘fredsDocs’.

 
1 
2factory.newCorpus("fredsDocs").addAll( 
3  docs.findAll{ 
4    it.getFeatures().get("annotator").equals("fred") 
5  } 
6)

Why won’t the ‘Groovy executing’ dialog go away? Sometimes, when you execute a Groovy script through the console, a dialog will appear, saying ‘Groovy is executing. Please wait’. The dialog fails to go away even when the script has ended, and cannot be closed by clicking the ‘Interrupt’ button. You can, however, continue to use the Groovy Console, and the dialog will usually go away next time you run a script. This is not a GATE problem: it is a Groovy problem.

7.16 Saving Config Data to gate.xml

Arbitrary feature/value data items can be saved to the user’s gate.xml file via the following API calls:

To get the config data: Map configData = Gate.getUserConfig().

To add config data simply put pairs into the map: configData.put("my new config key", "value");.

To write the config data back to the XML file: Gate.writeUserConfig();.

Note that new config data will simply override old values, where the keys are the same. In this way defaults can be set up by putting their values in the main gate.xml file, or the site gate.xml file; they can then be overridden by the user’s gate.xml file.

7.17 Annotation merging through the API [#]

If we have annotations about the same subject on the same document from different annotators, we may need to merge those annotations to form a unified annotation. Two approaches for merging annotations are implemented in the API, via static methods in the class gate.util.AnnotationMerging.

The two methods have very similar input and output parameters. Each of the methods takes an array of annotation sets, which should be the same annotation type on the same document from different annotators, as input. A single feature can also be specified as a parameter (or given asnull if no feature is to be specified).

The output is a map, the key of which is one merged annotation and the value of which represents the annotators (in terms of the indices of the array of annotation sets) who support the annotation. The methods also have a boolean input parameter to indicate whether or not the annotations from different annotators are based on the same set of instances, which can be determined by the static method public boolean isSameInstancesForAnnotators(AnnotationSet[] annsA) in the class gate.util.IaaCalculation. One instance corresponds to all the annotations with the same span. If the annotation sets are based on the same set of instances, the merging methods will ensure that the merged annotations are on the same set of instances.

The two methods corresponding to those described for the Annotation Merging plugin described in Section 19.11. They are:

Chapter 8
JAPE: Regular Expressions over Annotations [#]

If Osama bin Laden did not exist, it would be necessary to invent him. For the past four years, his name has been invoked whenever a US president has sought to increase the defence budget or wriggle out of arms control treaties. He has been used to justify even President Bush’s missile defence programme, though neither he nor his associates are known to possess anything approaching ballistic missile technology. Now he has become the personification of evil required to launch a crusade for good: the face behind the faceless terror.

The closer you look, the weaker the case against Bin Laden becomes. While the terrorists who inflicted Tuesday’s dreadful wound may have been inspired by him, there is, as yet, no evidence that they were instructed by him. Bin Laden’s presumed guilt appears to rest on the supposition that he is the sort of man who would have done it. But his culpability is irrelevant: his usefulness to western governments lies in his power to terrify. When billions of pounds of military spending are at stake, rogue states and terrorist warlords become assets precisely because they are liabilities.

The need for dissent, George Monbiot, The Guardian, Tuesday September 18, 2001.

JAPE is a Java Annotation Patterns Engine. JAPE provides finite state transduction over annotations based on regular expressions. JAPE is a version of CPSL – Common Pattern Specification Language1. This chapter introduces JAPE, and outlines the functionality available. (You can find an excellent tutorial here; thanks to Dhaval Thakker, Taha Osmin and Phil Lakin).

JAPE allows you to recognise regular expressions in annotations on documents. Hang on, there’s something wrong here: a regular language can only describe sets of strings, not graphs, and GATE’s model of annotations is based on graphs. Hmmm. Another way of saying this: typically, regular expressions are applied to character strings, a simple linear sequence of items, but here we are applying them to a much more complex data structure. The result is that in certain cases the matching process is non-deterministic (i.e. the results are dependent on random factors like the addresses at which data is stored in the virtual machine): when there is structure in the graph being matched that requires more than the power of a regular automaton to recognise, JAPE chooses an alternative arbitrarily. However, this is not the bad news that it seems to be, as it turns out that in many useful cases the data stored in annotation graphs in GATE (and other language processing systems) can be regarded as simple sequences, and matched deterministically with regular expressions.

A JAPE grammar consists of a set of phases, each of which consists of a set of pattern/action rules. The phases run sequentially and constitute a cascade of finite state transducers over annotations. The left-hand-side (LHS) of the rules consist of an annotation pattern description. The right-hand-side (RHS) consists of annotation manipulation statements. Annotations matched on the LHS of a rule may be referred to on the RHS by means of labels that are attached to pattern elements. Consider the following example:

 
Phase: Jobtitle  
Input: Lookup  
Options: control = appelt debug = true  
 
Rule: Jobtitle1  
(  
 {Lookup.majorType == jobtitle}  
 (  
  {Lookup.majorType == jobtitle}  
 )?  
)  
:jobtitle  
-->  
 :jobtitle.JobTitle = {rule = "JobTitle1"}  
 

The LHS is the part preceding the ‘-->’ and the RHS is the part following it. The LHS specifies a pattern to be matched to the annotated GATE document, whereas the RHS specifies what is to be done to the matched text. In this example, we have a rule entitled ‘Jobtitle1’, which will match text annotated with a ‘Lookup’ annotation with a ‘majorType’ feature of ‘jobtitle’, followed optionally by further text annotated as a ‘Lookup’ with ‘majorType’ of ‘jobtitle’. Once this rule has matched a sequence of text, the entire sequence is allocated a label by the rule, and in this case, the label is ‘jobtitle’. On the RHS, we refer to this span of text using the label given in the LHS; ‘jobtitle’. We say that this text is to be given an annotation of type ‘JobTitle’ and a ‘rule’ feature set to ‘JobTitle1’.

We began the JAPE grammar by giving it a phase name; ‘Phase: Jobtitle’. JAPE grammars can be cascaded, and so each grammar is considered to be a ‘phase’ (see Section 8.5). We also provide a list of the annotation types we will use in the grammar. In this case, we say ‘Input: Lookup’ because the only annotation type we use on the LHS are Lookup annotations. If no annotations are defined, all annotations will be matched.

Then, several options are set:

A wide range of functionality can be used with JAPE, making it a very powerful system. Section 8.1 gives an overview of some common LHS tasks. Section 8.2 talks about the various operators available for use on the LHS. After that, Section 8.3 outlines RHS functionality. Section 8.4 talks about priority and Section 8.5 talks about phases. Section 8.6 talks about using Java code on the RHS, which is the main way of increasing the power of the RHS. We conclude the chapter with some miscellaneous JAPE-related topics of interest.

8.1 The Left-Hand Side [#]

The LHS of a JAPE grammar aims to match the text span to be annotated, whilst avoiding undesirable matches. There are various tools available to enable you to do this. This section outlines how you would approach various common tasks on the LHS of your JAPE grammar.

8.1.1 Matching a Simple Text String

To match a simple text string, you need to refer to a feature on an annotation that contains the string; for example,

{Token.string == "of"}

The following grammar shows a sequence of strings being matched. Bracketing, along with the ‘or’ operator, is used to define how the strings should come together:

Phase: UrlPre  
Input:  Token SpaceToken  
Options: control = appelt  
 
Rule: Urlpre  
 
( (({Token.string == "http"} |  
  {Token.string == "ftp"})  
 {Token.string == ":"}  
 {Token.string == "/"}  
         {Token.string == "/"}  
        ) |  
({Token.string == "www"}  
         {Token.string == "."}  
        )  
):urlpre  
-->  
:urlpre.UrlPre = {rule = "UrlPre"}

Alternatively you could use the ‘string’ metaproperty. See Section 8.1.4 for an example of using metaproperties.

8.1.2 Matching Entire Annotation Types

You can specify the presence of an annotation previously assigned from a gazetteer, tokeniser, or other module. For example, the following will match a Lookup annotation:

{Lookup}

The following will match if there is not a Lookup annotation at this location:

{!Lookup}

The following rule shows several different annotation types being matched. We also see a string being matched, and again, the use of the ‘or’ operator:

Rule: Known  
Priority: 100  
(  
 {Location}|  
 {Person}|  
 {Date}|  
 {Organization}|  
 {Address}|  
 {Money} |  
 {Percent}|  
 {Token.string == "Dear"}|  
 {JobTitle}|  
 {Lookup}  
):known  
-->  
{}

8.1.3 Using Attributes and Values

You can specify the attributes (and values) of an annotation to be matched. Several operators are supported; see Section 8.2 for full details:

In the following rule, the ‘category’ feature of the ‘Token’ annotation is used, along with the ‘equals’ operator:

Rule: Unknown  
Priority: 50  
(  
 {Token.category == NNP}  
)  
:unknown  
-->  
 :unknown.Unknown = {kind = "PN", rule = Unknown}

8.1.4 Using Meta-Properties [#]

In addition to referencing annotation features, JAPE allows access to other ‘meta-properties’ of an annotation. This is done by using an ‘@’ symbol rather than a ‘.’ symbol after the annotation type name. The three meta-properties that are built in are:

{X@length > "5"}:label-->:label.New = {}

At this time, you cannot access the value of a ‘meta-property’ from a non-java RHS of a rule, e.g. you can’t write:

{X@length > "5"}:label-->:label.New = {somefeat = :label.X@length }

We hope to add this at some point.

8.1.5 Multiple Pattern/Action Pairs

It is also possible to have more than one pattern and corresponding action, as shown in the rule below. On the LHS, each pattern is enclosed in a set of round brackets and has a unique label; on the RHS, each label is associated with an action. In this example, the Lookup annotation is labelled ‘jobtitle’ and is given the new annotation JobTitle; the TempPerson annotation is labelled ‘person’ and is given the new annotation ‘Person’.

Rule: PersonJobTitle  
Priority: 20  
 
(  
 {Lookup.majorType == jobtitle}  
):jobtitle  
(  
 {TempPerson}  
):person  
-->  
    :jobtitle.JobTitle = {rule = "PersonJobTitle"},  
    :person.Person = {kind = "personName", rule = "PersonJobTitle"}

Similarly, labelled patterns can be nested, as in the example below, where the whole pattern is annotated as Person, but within the pattern, the jobtitle is annotated as JobTitle.

Rule: PersonJobTitle2  
Priority: 20  
 
(  
(  
 {Lookup.majorType == jobtitle}  
):jobtitle  
 {TempPerson}  
):person  
-->  
    :jobtitle.JobTitle = {rule = "PersonJobTitle"},  
    :person.Person = {kind = "personName", rule = "PersonJobTitle"}

8.1.6 LHS Macros [#]

Macros allow you to create a definition that can then be used multiple times in your JAPE rules. In the following JAPE grammar, we have a cascade of macros used. The macro ‘AMOUNT_NUMBER’ makes use of the macros ‘MILLION_BILLION’ and ‘NUMBER_WORDS’, and the rule ‘MoneyCurrencyUnit’ then makes use of ‘AMOUNT_NUMBER’:

Phase: Number  
Input: Token Lookup  
Options: control = appelt  
 
Macro: MILLION_BILLION  
({Token.string == "m"}|  
{Token.string == "million"}|  
{Token.string == "b"}|  
{Token.string == "billion"}|  
{Token.string == "bn"}|  
{Token.string == "k"}|  
{Token.string == "K"}  
)  
 
Macro: NUMBER_WORDS  
(  
 (({Lookup.majorType == number}  
   ({Token.string == "-"})?  
  )*  
   {Lookup.majorType == number}  
   {Token.string == "and"}  
 )*  
 ({Lookup.majorType == number}  
  ({Token.string == "-"})?  
 )*  
   {Lookup.majorType == number}  
)  
 
Macro: AMOUNT_NUMBER  
(({Token.kind == number}  
  (({Token.string == ","}|  
    {Token.string == "."}  
   )  
   {Token.kind == number}  
  )*  
  |  
  (NUMBER_WORDS)  
 )  
 (MILLION_BILLION)?  
)  
 
Rule: MoneyCurrencyUnit  
  (  
      (AMOUNT_NUMBER)  
      ({Lookup.majorType == currency_unit})  
  )  
:number -->  
  :number.Money = {kind = "number", rule = "MoneyCurrencyUnit"}

8.1.7 Using Context [#]

Context can be dealt with in the grammar rules in the following way. The pattern to be annotated is always enclosed by a set of round brackets. If preceding context is to be included in the rule, this is placed before this set of brackets. This context is described in exactly the same way as the pattern to be matched. If context following the pattern needs to be included, it is placed after the label given to the annotation. Context is used where a pattern should only be recognised if it occurs in a certain situation, but the context itself does not form part of the pattern to be annotated.

For example, the following rule for Time (assuming an appropriate macro for ‘year’) would mean that a year would only be recognised if it occurs preceded by the words ‘in’ or ‘by’:

Rule: YearContext1  
 
({Token.string == "in"}|  
 {Token.string == "by"}  
)  
(YEAR)  
:date -->  
 :date.Timex = {kind = "date", rule = "YearContext1"}

Similarly, the following rule (assuming an appropriate macro for ‘email’) would mean that an email address would only be recognised if it occurred inside angled brackets (which would not themselves form part of the entity):

Rule: Emailaddress1  
({Token.string == ‘<’})  
(  
 (EMAIL)  
)  
:email  
({Token.string == ‘>’})  
-->  
 :email.Address= {kind = "email", rule = "Emailaddress1"}

Also, it is possible to specify the constraint that one annotation must start at the same place as another. For example:

Rule: SurnameStartingWithDe  
(  
  {Token.string == "de",  
   Lookup.majorType == "name",  
   Lookup.minorType == "surname"}  
):de  
-->  
 :de.Surname = {prefix = "de"}

This rule would match anywhere where a Token with string ‘de’ and a Lookup with majorType ‘name’ and minorType ‘surname’ start at the same offset in the text. Both the Lookup and Token annotations would be included in the :de binding, so the Surname annotation generated would span the longer of the two. Constraints on the same annotation type must be satisfied by a single annotation, so in this example there must be a single Lookup matching both the major and minor types – the rule would not match if there were two different lookups at the same location, one of them satisfying each constraint.

It is important to remember that context is consumed by the rule, so it cannot be reused in another rule within the same phase. So, for example, right context cannot be used as left context for another rule.

8.1.8 Multi-Constraint Statements [#]

In the examples we have seen so far, most statements have contained only one constraint. For example, in this statement, the ‘category’ of ‘Token’ must equal ‘NNP’:

Rule: Unknown  
Priority: 50  
(  
 {Token.category == NNP}  
)  
:unknown  
-->  
 :unknown.Unknown = {kind = "PN", rule = Unknown}

However, it is equally acceptable to have multiple constraints in a statement. In this example, the ‘majorType’ of ‘Lookup’ must be ‘name’ and the ‘minorType’ must be ‘surname’:

Rule: Surname  
(  
  {Lookup.majorType == "name",  
   Lookup.minorType == "surname"}  
):surname  
-->  
 :surname.Surname = {}

As we saw in Section 8.1.7, the constraints may refer to different annotations. In this example, in addition to the constraints on the ‘majorType’ and ‘minorType’ of ‘Lookup’, we also have a constraint on the ‘string’ of ‘Token’:

Rule: SurnameStartingWithDe  
(  
  {Token.string == "de",  
   Lookup.majorType == "name",  
   Lookup.minorType == "surname"}  
):de  
-->  
 :de.Surname = {prefix = "de"}

8.1.9 Negation [#]

All the examples in the preceding sections involve constraints that require the presence of certain annotations to match. JAPE also supports ‘negative’ constraints which specify the absence of annotations. A negative constraint is signalled in the grammar by a ‘!’ character.

Negative constraints are generally used in combination with positive ones to constrain the locations at which the positive constraint can match. For example:

Rule: PossibleName  
(  
 {Token.orth == "upperInitial", !Lookup}  
):name  
-->  
 :name.PossibleName = {}

This rule would match any uppercase-initial Token, but only where there is no Lookup annotation starting at the same location. The general rule is that a negative constraint matches at any location where the corresponding positive constraint would not match. Negative constraints do not contribute any annotations to the bindings - in the example above, the :name binding would contain only the Token annotation. The exception to this is when a negative constraint is used alone, without any positive constraints in the combination. In this case it binds all the annotations at the match position that do not match the constraint. Thus, {!Lookup} would bind all the annotations starting at this location except Lookups. In most cases, negative constraints should only be used in combination with positive ones.

Any constraint can be negated, for example:

Rule: SurnameNotStartingWithDe  
(  
 {Surname, !Token.string ==~ "[Dd]e"}  
):name  
-->  
 :name.NotDe = {}

This would match any Surname annotation that does not start at the same place as a Token with the string ‘de’ or ‘De’. Note that this is subtly different from {Surname, Token.string !=~ "[Dd]e"}, as the second form requires a Token annotation to be present, whereas the first form (!Token...) will match if there is no Token annotation at all at this location.2

Although JAPE provides an operator to look for the absence of a single annotation type, there is no support for a general negative operator to prevent a rule from firing if a particular sequence of annotations is found. One solution to this is to create a ‘negative rule’ which has higher priority than the matching ‘positive rule’. The style of matching must be Appelt for this to work. To create a negative rule, simply state on the LHS of the rule the pattern that should NOT be matched, and on the RHS do nothing. In this way, the positive rule cannot be fired if the negative pattern matches, and vice versa, which has the same end result as using a negative operator. A useful variation for developers is to create a dummy annotation on the RHS of the negative rule, rather than to do nothing, and to give the dummy annotation a rule feature. In this way, it is obvious that the negative rule has fired. Alternatively, use Java code on the RHS to print a message when the rule fires. An example of a matching negative and positive rule follows. Here, we want a rule which matches a surname followed by a comma and a set of initials. But we want to specify that the initials shouldn’t have the POS category PRP (personal pronoun). So we specify a negative rule that will fire if the PRP category exists, thereby preventing the positive rule from firing.

Rule: NotPersonReverse  
Priority: 20  
// we don’t want to match ’Jones, I’  
(  
 {Token.category == NNP}  
 {Token.string == ","}  
 {Token.category == PRP}  
)  
:foo  
-->  
{}  
 
Rule:   PersonReverse  
Priority: 5  
// we want to match ‘Jones, F.W.’  
 
(  
 {Token.category == NNP}  
 {Token.string == ","}  
 (INITIALS)?  
)  
:person -->

8.1.10 Escaping Special Characters

To specify a single or double quote as a string, precede it with a backslash, e.g.

{Token.string=="\""}

will match a double quote. For other special characters, such as ‘$’, enclose it in double quotes, e.g.

{Token.category == "PRP\$"}

8.2 LHS Operators in Detail [#]

This section gives more detail on the behaviour of the matching operators used on the left-hand side of JAPE rules.

8.2.1 Compositional Operators [#]

Compositional operators are used to combine matching constructions in the manner intended. Union and Kleene operators are available, as is range notation.

Union and Kleene Operators

The following union and Kleene operators are available:

In the following example, you can see the ‘|’ and ‘?’ operators being used:

Rule: LocOrganization  
Priority: 50  
 
(  
 ({Lookup.majorType == location} |  
  {Lookup.majorType == country_adj})  
{Lookup.majorType == organization}  
({Lookup.majorType == organization})?  
)  
:orgName -->  
  :orgName.TempOrganization = {kind = "orgName", rule=LocOrganization}

Range Notation [#]

A range notation can also be added. e.g.

({Token})[1,3]

matches one to three Tokens in a row.

({Token.kind == number})[3]

matches exactly 3 number Tokens in a row.

8.2.2 Matching Operators [#]

Matching operators are used to specify how matching must take place between a specification and an annotation in the document. Equality (‘==’ and ‘!=’) and comparison (‘<’, ‘<=’, ‘>=’ and ‘>’) operators can be used, as can regular expression matching and contextual operators (‘contains’ and ‘within’).

Equality Operators

The equality operators are ‘==’ and ‘!=’. The basic operator in JAPE is equality. {Lookup.majorType == "person"} matches a Lookup annotation whose majorType feature has the value ‘person’. Similarly {Lookup.majorType != "person"} would match any Lookup whose majorType feature does not have the value ‘person’. If a feature is missing it is treated as if it had an empty string as its value, so this would also match a Lookup annotation that did not have a majorType feature at all.

Certain type coercions are performed:

The != operator matches exactly when == doesn’t.

Comparison Operators

The comparison operators are ‘<’, ‘<=’, ‘>=’ and ‘>’. Comparison operators have their expected meanings, for example {Token.length > 3} matches a Token annotation whose length attribute is an integer greater than 3. The behaviour of the operators depends on the type of the constraint’s attribute:

Regular Expression Operators [#]

The regular expression operators are ‘=~’, ‘==~’, ‘!~’ and ‘!=~’. These operators match regular expressions. {Token.string =~ "[Dd]ogs"} matches a Token annotation whose string feature contains a substring that matches the regular expression [Dd]ogs, using !~ would match if the feature value does not contain a substring that matches the regular expression. The ==~ and !=~ operators are like =~ and !~ respectively, but require that the whole value match (or not match) the regular expression3. As with ==, missing features are treated as if they had the empty string as their value, so the constraint {Identifier.name ==~ "(?i)[aeiou]*"} would match an Identifier annotation which does not have a name feature, as well as any whose name contains only vowels.

The matching uses the standard Java regular expression library, so full details of the pattern syntax can be found in the JavaDoc documentation for java.util.regex.Pattern. There are a few specific points to note:

Contextual Operators [#]

The contextual Operators are ‘contains’ and ‘within’. These operators match annotations within the context of other annotations.

For either operator, the right-hand value (Y in the above examples) can be a full constraint itself. For example {X contains {Y.foo==bar}} is also accepted. The operators can be used in a multi-constraint statement (see Section 8.1.8) just like any of the traditional ones, so {X.f1 != "something", X contains {Y.foo==bar}} is valid.

Custom Operators [#]

It is possible to add additional custom operators without modifying the JAPE language. There are new init-time parameters to Transducer so that additional annotation ‘meta-property’ accessors and custom operators can be referenced at runtime. To add a custom operator, write a class that implements gate.jape.constraint.ConstraintPredicate, and then list that class name for the Transducer’s ‘operators’ property. Similarly, to add a custom ‘meta-property’ accessor, write a class that implements gate.jape.constraint.AnnotationAccessor, and then list that class name in the Transducer’s ‘annotationAccessors’ property.

8.3 The Right-Hand Side [#]

The RHS of the rule contains information about the annotation to be created/manipulated. Information about the text span to be annotated is transferred from the LHS of the rule using the label just described, and annotated with the entity type (which follows it). Finally, attributes and their corresponding values are added to the annotation. Alternatively, the RHS of the rule can contain Java code to create or manipulate annotations, see Section 8.6.

8.3.1 A Simple Example

In the simple example below, the pattern described will be awarded an annotation of type ‘Enamex’ (because it is an entity name). This annotation will have the attribute ‘kind’, with value ‘location’, and the attribute ‘rule’, with value ‘GazLocation’. (The purpose of the ‘rule’ attribute is simply to ease the process of manual rule validation).

Rule: GazLocation  
(  
{Lookup.majorType == location}  
)  
:location -->  
 :location.Enamex = {kind="location", rule=GazLocation}

8.3.2 Copying Feature Values from the LHS to the RHS

JAPE provides limited support for copying annotation feature values from the left to the right hand side of a rule, for example:

Rule: LocationType  
 
(  
 {Lookup.majorType == location}  
):loc  
-->  
    :loc.Location = {rule = "LocationType", type = :loc.Lookup.minorType}

This will set the ‘type’ feature of the generated location to the value of the ‘minorType’ feature from the ‘Lookup’ annotation bound to the loc label. If the Lookup has no minorType, the Location will have no ‘type’ feature. The behaviour of newFeat = :bind.Type.oldFeat is:

Notice that the behaviour is deliberately underspecified if there is more than one Type annotation in bind. If you need more control, or if you want to copy several feature values from the same left hand side annotation, you should consider using Java code on the right hand side of your rule (see Section 8.6).

8.3.3 RHS Macros

Macros, first introduced in the context of the left-hand side (Section 8.1.6) can also be used on the RHS of rules. In this case, the label (which matches the label on the LHS of the rule) should be included in the macro. Below we give an example of using a macro on the RHS:

Macro: UNDERSCORES_OKAY          // separate  
:match                                              // lines  
{  
    gate.AnnotationSet matchedAnns = (gate.AnnotationSet)bindings.get("match");  
 
    int begOffset = matchedAnns.firstNode().getOffset().intValue();  
    int endOffset = matchedAnns.lastNode().getOffset().intValue();  
    String mydocContent = doc.getContent().toString();  
    String matchedString = mydocContent.substring(begOffset, endOffset);  
 
    gate.FeatureMap newFeatures = Factory.newFeatureMap();  
 
    if(matchedString.equals("Spanish"))     {  
     newFeatures.put("myrule",  "Lower");  
    }  
    else    {  
     newFeatures.put("myrule",  "Upper");  
    }  
 
    newFeatures.put("quality",  "1");  
    annotations.add(matchedAnns.firstNode(), matchedAnns.lastNode(),  
                              "Spanish_mark", newFeatures);  
}  
 
Rule: Lower  
(  
    ({Token.string == "Spanish"})  
:match)-->UNDERSCORES_OKAY   // no label here, only macro name  
 
Rule: Upper  
(  
    ({Token.string == "SPANISH"})  
:match)-->UNDERSCORES_OKAY   // no label here, only macro name  
 
 

8.4 Use of Priority [#]

Each grammar has one of 5 possible control styles: ‘brill’, ‘all’, ‘first’, ‘once’ and ‘appelt’. This is specified at the beginning of the grammar.

The Brill style means that when more than one rule matches the same region of the document, they are all fired. The result of this is that a segment of text could be allocated more than one entity type, and that no priority ordering is necessary. Brill will execute all matching rules starting from a given position and will advance and continue matching from the position in the document where the longest match finishes.

The ‘all’ style is similar to Brill, in that it will also execute all matching rules, but the matching will continue from the next offset to the current one.

For example, where [] are annotations of type Ann

[aaa[bbb]] [ccc[ddd]]

then a rule matching {Ann} and creating {Ann-2} for the same spans will generate:

BRILL: [aaabbb] [cccddd]  
ALL: [aaa[bbb]] [ccc[ddd]]

With the ‘first’ style, a rule fires for the first match that’s found. This makes it inappropriate for rules that end in ‘+’ or ‘?’ or ‘*’. Once a match is found the rule is fired; it does not attempt to get a longer match (as the other two styles do).

With the ‘once’ style, once a rule has fired, the whole JAPE phase exits after the first match.

With the appelt style, only one rule can be fired for the same region of text, according to a set of priority rules. Priority operates in the following way.

  1. From all the rules that match a region of the document starting at some point X, the one which matches the longest region is fired.
  2. If more than one rule matches the same region, the one with the highest priority is fired
  3. If there is more than one rule with the same priority, the one defined earlier in the grammar is fired.

An optional priority declaration is associated with each rule, which should be a positive integer. The higher the number, the greater the priority. By default (if the priority declaration is missing) all rules have the priority -1 (i.e. the lowest priority).

For example, the following two rules for location could potentially match the same text.

Rule:   Location1  
Priority: 25  
 
(  
 ({Lookup.majorType == loc_key, Lookup.minorType == pre}  
  {SpaceToken})?  
 {Lookup.majorType == location}  
 ({SpaceToken}  
  {Lookup.majorType == loc_key, Lookup.minorType == post})?  
)  
:locName -->  
  :locName.Location = {kind = "location", rule = "Location1"}  
 
 
Rule: GazLocation  
Priority: 20  
  (  
  ({Lookup.majorType == location}):location  
  )  
  -->   :location.Name = {kind = "location", rule=GazLocation}

Assume we have the text ‘China sea’, that ‘China’ is defined in the gazetteer as ‘location’, and that sea is defined as a ‘loc_key’ of type ‘post’. In this case, rule Location1 would apply, because it matches a longer region of text starting at the same point (‘China sea’, as opposed to just ‘China’). Now assume we just have the text ‘China’. In this case, both rules could be fired, but the priority for Location1 is highest, so it will take precedence. In this case, since both rules produce the same annotation, so it is not so important which rule is fired, but this is not always the case.

One important point of which to be aware is that prioritisation only operates within a single grammar. Although we could make priority global by having all the rules in a single grammar, this is not ideal due to other considerations. Instead, we currently combine all the rules for each entity type in a single grammar. An index file (main.jape) is used to define which grammars should be used, and in which order they should be fired.

Note also that depending on the control style, firing a rule may ‘consume’ that part of the text, making it unavailable to be matched by other rules. This can be a problem for example if one rule uses context to make it more specific, and that context is then missed by later rules, having been consumed due to use of for example the ‘Brill’ control style. ‘All’, on the other hand, would allow it to be matched.

Using priority to resolve ambiguity

If the Appelt style of matching is selected, rule priority operates in the following way.

  1. Length of rule – a rule matching a longer pattern will fire first.
  2. Explicit priority declaration. Use the optional Priority function to assign a ranking. The higher the number, the higher the priority. If no priority is stated, the default is -1.
  3. Order of rules. In the case where the above two factors do not distinguish between two rules, the order in which the rules are stated applies. Rules stated first have higher priority.

Because priority can only operate within a single grammar, this can be a problem for dealing with ambiguity issues. One solution to this is to create a temporary set of annotations in initial grammars, and then manipulate this temporary set in one or more later phases (for example, by converting temporary annotations from different phases into permanent annotations in a single final phase). See the default set of grammars for an example of this.

If two possible ways of matching are found for the same text string, a conflict can arise. Normally this is handled by the priority mechanism (test length, rule priority and finally rule precedence). If all these are equal, Jape will simply choose a match at random and fire it. This leads ot non-deterministic behaviour, which should be avoided.

8.5 Using Phases Sequentially [#]

A JAPE grammar consists of a set of sequential phases. The list of phases is specified (in the order in which they are to be run) in a file, conventionally named main.jape. When loading the grammar into GATE, it is only necessary to load this main file – the phases will then be loaded automatically. It is, however, possible to omit this main file, and just load the phases individually, but this is much more time-consuming. The grammar phases do not need to be located in the same directory as the main file, but if they are not, the relative path should be specified for each phase.

One of the main reasons for using a sequence of phases is that a pattern can only be used once in each phase, but it can be reused in a later phase. Combined with the fact that priority can only operate within a single grammar, this can be exploited to help deal with ambiguity issues.

The solution currently adopted is to write a grammar phase for each annotation type, or for each combination of similar annotation types, and to create temporary annotations. These temporary annotations are accessed by later grammar phases, and can be manipulated as necessary to resolve ambiguity or to merge consecutive annotations. The temporary annotations can either be removed later, or left and simply ignored.

Generally, annotations about which we are more certain are created earlier on. Annotations which are more dubious may be created temporarily, and then manipulated by later phases as more information becomes available.

An annotation generated in one phase can be referred to in a later phase, in exactly the same way as any other kind of annotation (by specifying the name of the annotation within curly braces). The features and values can be referred to or omitted, as with all other annotations. Make sure that if the Input specification is used in the grammar, that the annotation to be referred to is included in the list.

8.6 Using Java Code on the RHS [#]

The RHS of a JAPE rule can consist of any Java code. This is useful for removing temporary annotations and for percolating and manipulating features from previous annotations. In the example below

The first rule below shows a rule which matches a first person name, e.g. ‘Fred’, and adds a gender feature depending on the value of the minorType from the gazetteer list in which the name was found. We first get the bindings associated with the person label (i.e. the Lookup annotation). We then create a new annotation called ‘personAnn’ which contains this annotation, and create a new FeatureMap to enable us to add features. Then we get the minorType features (and its value) from the personAnn annotation (in this case, the feature will be ‘gender’ and the value will be ‘male’), and add this value to a new feature called ‘gender’. We create another feature ‘rule’ with value ‘FirstName’. Finally, we add all the features to a new annotation ‘FirstPerson’ which attaches to the same nodes as the original ‘person’ binding.

Note that inputAS and outputAS represent the input and output annotation set. Normally, these would be the same (by default when using ANNIE, these will be the ‘Default’ annotation set). Since the user is at liberty to change the input and output annotation sets in the parameters of the JAPE transducer at runtime, it cannot be guaranteed that the input and output annotation sets will be the same, and therefore we must specify the annotation set we are referring to.

Rule: FirstName  
 
(  
 {Lookup.majorType == person_first}  
):person  
-->  
{  
gate.AnnotationSet person = (gate.AnnotationSet)bindings.get("person");  
gate.Annotation personAnn = (gate.Annotation)person.iterator().next();  
gate.FeatureMap features = Factory.newFeatureMap();  
features.put("gender", personAnn.getFeatures().get("minorType"));  
features.put("rule", "FirstName");  
outputAS.add(person.firstNode(), person.lastNode(), "FirstPerson",  
features);  
}

The second rule (contained in a subsequent grammar phase) makes use of annotations produced by the first rule described above. Instead of percolating the minorType from the annotation produced by the gazetteer lookup, this time it percolates the feature from the annotation produced by the previous grammar rule. So here it gets the ‘gender’ feature value from the ‘FirstPerson’ annotation, and adds it to a new feature (again called ‘gender’ for convenience), which is added to the new annotation (in outputAS) ‘TempPerson’. At the end of this rule, the existing input annotations (from inputAS) are removed because they are no longer needed. Note that in the previous rule, the existing annotations were not removed, because it is possible they might be needed later on in another grammar phase.

Rule: GazPersonFirst  
(  
 {FirstPerson}  
)  
:person  
-->  
{  
gate.AnnotationSet person = (gate.AnnotationSet)bindings.get("person");  
gate.Annotation personAnn = (gate.Annotation)person.iterator().next();  
gate.FeatureMap features = Factory.newFeatureMap();  
 
features.put("gender", personAnn.getFeatures().get("gender"));  
features.put("rule", "GazPersonFirst");  
outputAS.add(person.firstNode(), person.lastNode(), "TempPerson",  
features);  
inputAS.removeAll(person);  
}

8.6.1 A More Complex Example

The example below is more complicated, because both the title and the first name (if present) may have a gender feature. There is a possibility of conflict since some first names are ambiguous, or women are given male names (e.g. Charlie). Some titles are also ambiguous, such as ‘Dr’, in which case they are not marked with a gender feature. We therefore take the gender of the title in preference to the gender of the first name, if it is present. So, on the RHS, we first look for the gender of the title by getting all Title annotations which have a gender feature attached. If a gender feature is present, we add the value of this feature to a new gender feature on the Person annotation we are going to create. If no gender feature is present, we look for the gender of the first name by getting all firstPerson annotations which have a gender feature attached, and adding the value of this feature to a new gender feature on the Person annotation we are going to create. If there is no firstPerson annotation and the title has no gender information, then we simply create the Person annotation with no gender feature.

Rule: PersonTitle  
Priority: 35  
/* allows Mr. Jones, Mr Fred Jones etc. */  
 
(  
 (TITLE)  
(FIRSTNAME | FIRSTNAMEAMBIG | INITIALS2)*  
 (PREFIX)?  
 {Upper}  
 ({Upper})?  
 (PERSONENDING)?  
)  
:person -->  
 {  
gate.FeatureMap features = Factory.newFeatureMap();  
gate.AnnotationSet personSet = (gate.AnnotationSet)bindings.get("person");  
 
// get all Title annotations that have a gender feature  
 HashSet fNames = new HashSet();  
    fNames.add("gender");  
    gate.AnnotationSet personTitle = personSet.get("Title", fNames);  
 
// if the gender feature exists  
 if (personTitle != null && personTitle.size()>0)  
{  
  gate.Annotation personAnn = (gate.Annotation)personTitle.iterator().next();  
  features.put("gender", personAnn.getFeatures().get("gender"));  
}  
else  
{  
  // get all firstPerson annotations that have a gender feature  
    gate.AnnotationSet firstPerson = personSet.get("FirstPerson", fNames);  
 
  if (firstPerson != null && firstPerson.size()>0)  
  // create a new gender feature and add the value from firstPerson  
 {  
  gate.Annotation personAnn = (gate.Annotation)firstPerson.iterator().next();  
  features.put("gender", personAnn.getFeatures().get("gender"));  
 }  
}  
  // create some other features  
  features.put("kind", "personName");  
  features.put("rule", "PersonTitle");  
  // create a Person annotation and add the features we’ve created  
outputAS.add(personSet.firstNode(), personSet.lastNode(), "TempPerson",  
features);  
}

8.6.2 Adding a Feature to the Document [#]

This is useful when using conditional controllers, where we only want to fire a particular resource under certain conditions. We first test the document to see whether it fulfils these conditions or not, and attach a feature to the document accordingly.

In the example below, we test whether the document contains an annotation of type ‘message’. In emails, there is often an annotation of this type (produced by the document format analysis when the document is loaded in GATE). Note that annotations produced by document format analysis are placed automatically in the ‘Original markups’ annotation set, so we must ensure that when running the processing resource containing this grammar that we specify the Original markups set as the input annotation set. It does not matter what we specify as the output annotation set, because the annotation we produce is going to be attached to the document and not to an output annotation set. In the example, if an annotation of type ‘message’ is found, we add the feature ‘genre’ with value ‘email’ to the document.

 Rule: Email  
 Priority: 150  
 
 (  
  {message}  
 )  
 -->  
 {  
 doc.getFeatures().put("genre", "email");  
 }

8.6.3 Finding the Tokens of a Matched Annotation [#]

In this section we will demonstrate how by using Java on the right-hand side one can find all Token annotations that are covered by a matched annotation, e.g., a Person or an Organization. This is useful if one wants to transfer some information from the matched annotations to the tokens. For example, to add to the Tokens a feature indicating whether or not they are covered by a named entity annotation deduced by the rule-based system. This feature can then be given as a feature to a learning PR, e.g. the HMM. Similarly, one can add a feature to all tokens saying which rule in the rule based system did the match, the idea being that some rules might be more reliable than others. Finally, yet another useful feature might be the length of the coreference chain in which the matched entity is involved, if such exists.

The example below is one of the pre-processing JAPE grammars used by the HMM application. To inspect all JAPE grammars, see the muse/applications/hmm directory in the distribution.

Phase:  NEInfo  
 
Input: Token Organization Location Person  
 
Options: control = appelt  
 
Rule:   NEInfo  
 
Priority:100  
 
({Organization} | {Person} | {Location}):entity  
-->  
{  
  //get the annotation set  
  gate.AnnotationSet annSet = ((gate.AnnotationSet)bindings.get("entity"));  
 
  //get the only annotation from the set  
  gate.Annotation entityAnn = (gate.Annotation)annSet.iterator().next();  
 
  gate.AnnotationSet tokenAS = inputAS.get("Token",  
        entityAnn.getStartNode().getOffset(),  
        entityAnn.getEndNode().getOffset());  
  List tokens = new ArrayList(tokenAS);  
  //if no tokens to match, do nothing  
  if (tokens.isEmpty())  
     return;  
  Collections.sort(tokens, new gate.util.OffsetComparator());  
 
  gate.Annotation curToken=null;  
  for (int i=0; i < tokens.size(); i++) {  
    curToken = (gate.Annotation) tokens.get(i);  
    String ruleInfo = (String) entityAnn.getFeatures().get("rule1");  
    String NMRuleInfo = (String) entityAnn.getFeatures().get("NMRule");  
    if ( ruleInfo != null) {  
      curToken.getFeatures().put("rule_NE_kind", entityAnn.getType());  
      curToken.getFeatures().put("NE_rule_id", ruleInfo);  
    }  
    else if (NMRuleInfo != null) {  
      curToken.getFeatures().put("rule_NE_kind", entityAnn.getType());  
      curToken.getFeatures().put("NE_rule_id", "orthomatcher");  
    }  
    else {  
      curToken.getFeatures().put("rule_NE_kind", "None");  
      curToken.getFeatures().put("NE_rule_id", "None");  
    }  
    List matchesList = (List) entityAnn.getFeatures().get("matches");  
    if (matchesList != null) {  
      if (matchesList.size() == 2)  
        curToken.getFeatures().put("coref_chain_length", "2");  
      else if (matchesList.size() > 2 && matchesList.size() < 5)  
        curToken.getFeatures().put("coref_chain_length", "3-4");  
    else  
        curToken.getFeatures().put("coref_chain_length", "5-more");  
    }  
    else  
      curToken.getFeatures().put("coref_chain_length", "0");  
  }//for  
 
}  
 
Rule:   TokenNEInfo  
Priority:10  
({Token}):entity  
-->  
{  
  //get the annotation set  
  gate.AnnotationSet annSet = ((gate.AnnotationSet)bindings.get("entity"));  
 
  //get the only annotation from the set  
  gate.Annotation entityAnn = (gate.Annotation)annSet.iterator().next();  
 
  entityAnn.getFeatures().put("rule_NE_kind", "None");  
  entityAnn.getFeatures().put("NE_rule_id", "None");  
  entityAnn.getFeatures().put("coref_chain_length", "0");  
}  

8.6.4 Using Named Blocks [#]

For the common case where a Java block refers just to the annotations from a single left-hand-side binding, JAPE provides a shorthand notation:

Rule: RemoveDoneFlag  
 
(  
  {Instance.flag == "done"}  
):inst  
-->  
:inst{  
  Annotation theInstance = (Annotation)instAnnots.iterator().next();  
  theInstance.getFeatures().remove("flag");  
}

This rule is equivalent to the following:

Rule: RemoveDoneFlag  
 
(  
  {Instance.flag == "done"}  
):inst  
-->  
{  
  AnnotationSet instAnnots = (AnnotationSet)bindings.get("inst");  
  if(instAnnots != null && instAnnots.size() != 0) {  
    Annotation theInstance = (Annotation)instAnnots.iterator().next();  
    theInstance.getFeatures().remove("flag");  
  }  
}

A label :<label> on a Java block creates a local variable <label>Annots within the Java block which is the AnnotationSet bound to the <label> label. Also, the Java code in the block is only executed if there is at least one annotation bound to the label, so you do not need to check this condition in your own code. Of course, if you need more flexibility, e.g. to perform some action in the case where the label is not bound, you will need to use an unlabelled block and perform the bindings.get() yourself.

8.6.5 Java RHS Overview [#]

When a JAPE grammar is parsed, a Jape parser creates action classes for all Java RHSs in the grammar. (one action class per RHS) RHS Java code will be embedded as a body of the method doIt and will work in context of this method. When a particular rule is fired, the method doIt will be executed.

Method doIt is specified by the interface gate.jape.RhsAction. Each action class implements this interface and is generated with the following template:

 
1import java.io.*; 
2import java.util.*; 
3import gate.*; 
4import gate.jape.*; 
5import gate.creole.ontology.*; 
6import gate.annotation.*; 
7import gate.util.*; 
8class <AutogeneratedActionClassName> 
9         implements java.io.Serializable, gate.jape.RhsAction { 
10    public void doIt(gate.Document doc, 
11                     java.util.Map bindings, 
12                     gate.AnnotationSet annotations, 
13                     gate.AnnotationSet inputAS, 
14                     gate.AnnotationSet outputAS, 
15                     gate.creole.ontology.Ontology ontology) 
16                     throws JapeException { 
17           // your RHS Java code will be embedded here ... 
18    } 
19}

Method doIt has the following parameters that can be used in RHS Java code:

In your Java RHS you can use short names for all Java classes that are imported by the action class (plus Java classes from the packages that are imported by default according to JVM specification: java.lang.*, java.math.*). But you need to use fully qualified Java class names for all other classes. For example:

-->  
{  
  // VALID line examples  
  AnnotationSet as = ...  
  InputStream is = ...  
  java.util.logging.Logger myLogger =  
          java.util.logging.Logger.getLogger("JAPELogger");  
  java.sql.Statement stmt = ...  
 
  // INVALID line examples  
  Logger myLogger = Logger.getLogger("JapePhaseLogger");  
  Statement stmt = ...  
}

In order to add additional Java import statements to all Java RHS’ of the rules in a JAPE grammar file, you can use the following code at the beginning of the JAPE file:

Imports: {  
import java.util.logging.Logger;  
import java.sql.*;  
}

These import statements will be added to the default import statements for each action class generated for a RHS and the corresponding classes can be used in the RHS Java code without the need to use fully qualified names.

8.7 Optimising for Speed [#]

The way in which grammars are designed can have a huge impact on the processing speed. Some simple tricks to keep the processing as fast as possible are:

8.8 Ontology Aware Grammar Transduction [#]

GATE supports two different methods for ontology aware grammar transduction. Firstly it is possible to use the ontology feature both in grammars and annotations, while using the default transducer. Secondly it is possible to use an ontology aware transducer by passing an ontology language resource to one of the subsumes methods in SimpleFeatureMapImpl. This second strategy does not check for ontology features, which will make the writing of grammars easier, as there is no need to specify ontology when writing them. More information about the ontology-aware transducer can be found in Section 14.9.

8.9 Serializing JAPE Transducer [#]

JAPE grammars are written as files with the extension ‘.jape’, which are parsed and compiled at run-time to execute them over the GATE document(s). Serialization of the JAPE Transducer adds the capability to serialize such grammar files and use them later to bootstrap new JAPE transducers, where they do not need the original JAPE grammar file. This allows people to distribute the serialized version of their grammars without disclosing the actual contents of their jape files. This is implemented as part of the JAPE Transducer PR. The following sections describe how to serialize and deserialize them.

8.9.1 How to Serialize?

Once an instance of a JAPE transducer is created, the option to serialize it appears in the context menu of that instance. The context menu can be activated by right clicking on the respective PR. Having done so, it asks for the file name where the serialized version of the respective JAPE grammar is stored.

8.9.2 How to Use the Serialized Grammar File?

The JAPE Transducer now also has an init-time parameter binaryGrammarURL, which appears as an optional parameter to the grammarURL. The User can use this parameter (i.e. binaryGrammarURL) to specify the serialized grammar file.

8.10 The JAPE Debugger [#]

As of Version 5.1 the Jape debugger is not supported.

8.11 Notes for Montreal Transducer Users [#]

In June 2008, the standard JAPE transducer implementation gained a number of features inspired by Luc Plamondon’s ‘Montreal Transducer’, which was available as a GATE plugin for several years, and was made obsolete in Version 5.1. If you have existing Montreal Transducer grammars and want to update them to work with the standard JAPE implementation you should be aware of the following differences in behaviour:

Chapter 9
ANNIC: ANNotations-In-Context [#]

ANNIC (ANNotations-In-Context) is a full-featured annotation indexing and retrieval system. It is provided as part of an extension of the Serial Data-stores, called Searchable Serial Data-store (SSD).

ANNIC can index documents in any format supported by the GATE system (i.e., XML, HTML, RTF, e-mail, text, etc). Compared with other such query systems, it has additional features addressing issues such as extensive indexing of linguistic information associated with document content, independent of document format. It also allows indexing and extraction of information from overlapping annotations and features. Its advanced graphical user interface provides a graphical view of annotation markups over the text, along with an ability to build new queries interactively. In addition, ANNIC can be used as a first step in rule development for NLP systems as it enables the discovery and testing of patterns in corpora.

ANNIC is built on top of the Apache Lucene1 – a high performance full-featured search engine implemented in Java, which supports indexing and search of large document collections. Our choice of IR engine is due to the customisability of Lucene. For more details on how Lucene was modified to meet the requirements of indexing and querying annotations, please refer to [Aswani et al. 05].

As explained earlier, SSD is an extension of the serial data-store. In addition to the persist location, SSD asks user to provide some more information (explained later) that it uses to index the documents. Once the SSD has been initiated, user can add/remove documents/corpora to the SSD in a similar way it is done with other data-stores. When documents are added to the SSD, it automatically tries to index them. It updates the index whenever there is a change in any of the documents stored in the SSD and removes the document from the index if it is deleted from the SSD. Be warned that only the annotation sets, types and features initially provided during the SSD creation time, will be updated when adding/removing documents to the datastore.

SSD has an advanced graphical interface that allows users to issue queries over the SSD. Below we explain the parameters required by SSD and how to instantiate it, how to use its graphical interface and how to use SSD programmatically.

9.1 Instantiating SSD [#]

Steps:

  1. In GATE Developer, right click on ‘Datastores’ and select ‘Create Datastore’.
  2. From a drop-down list select ‘Lucene Based Searchable DataStore’.
  3. Here, you will see an input window. Please provide these parameters:
    1. DataStore URL: Select an empty folder where the DS is created.
    2. Index Location: Select an empty folder. This is where the index will be created.
    3. Annotation Sets: Here, you can provide one or more annotation sets that you wish to index or exclude from being indexed. In order to be able to index the default annotation set, you must click on the edit list icon and add an empty field to the list. If there are no annotation sets provided, all the annotation sets in all documents are indexed.
    4. Base-Token Type: (e.g. Token or Key.Token) These are the basic tokens of any document. Your documents must have the annotations of Base-Token-Type in order to get indexed. These basic tokens are used for displaying contextual information while searching patterns in the corpus. In case of indexing more than one annotation set, user can specify the annotation set from which the tokens should be taken (e.g. Key.Token- annotations of type Token from the annotation set called Key). In case user does not provide any annotation set name (e.g. Token), the system searches in all the annotation sets to be indexed and the base-tokens from the first annotation set with the base token annotations are taken. Please note that the documents with no base-tokens are not indexed. However, if the ”create tokens automatically” option is selected, the SSD creates base-tokens automatically. Here, each string delimited with white space is considered as a token.
    5. Index Unit Type: (e.g. Sentence, Key.Sentence) This specifies the unit of Index. In other words, annotations lying within the boundaries of these annotations are indexed (e.g. in the case of “Sentences”, no annotations that are spanned across the boundaries of two sentences are considered for indexing). User can specify from which annotation set the index unit annotations should be considered. If user does not provide any annotation set, the SSD searches among all annotation sets for index units. If this field is left empty or SSD fails to locate index units, the entire document is considered as a single unit.
    6. Features: Finally, users can specify the annotation types and features that should be indexed or excluded from being indexed. (e.g. SpaceToken and Split). If user wants to exclude only a specific feature of a specific annotation type, he/she can specify it using a ’.’ separator between the annotation type and its feature (e.g. Person.matches).
  4. Click OK. If all parameters are OK, a new empty DS will be created.
  5. Create an empty corpus and save it to the SSD.
  6. Populate it with some documents. Each document added to the corpus and eventually to the SSD is indexed automatically. If the document does not have the required annotations, that document is skipped and not indexed.

9.2 Search GUI [#]


PIC

Figure 9.1: Searchable Serial Datastore Viewer.


9.2.1 Overview

Figure 9.1 shows the search GUI for a datastore. The top section contains a text area to write a query, lists to select the corpus and annotation set to search in, sliders to set the size of the results and context and icons to execute and clear the query.

The central section shows a graphical visualisation of stacked annotations and feature values for the result row selected in the bottom results table. You can also see the stack view configuration window where you define which annotation type and feature to display in the central section.

The bottom section contains the results table of the query, i.e. the text that matches the query with their left and right contexts. The bottom section contains also a tabbed pane of statistics.

9.2.2 Syntax of Queries [#]

SSD enables you to formulate versatile queries using a subset of JAPE patterns. Below, we give the JAPE pattern clauses which can be used as SSD queries. Queries can also be a combination of one or more of the following pattern clauses.

  1. String
  2. {AnnotationType}
  3. {AnnotationType == String}
  4. {AnnotationType.feature == feature value}
  5. {AnnotationType1, AnnotationType2.feature == featureValue}
  6. {AnnotationType1.feature == featureValue, AnnotationType2.feature == featureValue}

JAPE patterns also support the | (OR) operator. For instance, {A} ({B}|{C}) is a pattern of two annotations where the first is an annotation of type A followed by the annotation of type either B or C.

ANNIC supports two operators, + and *, to specify the number of times a particular annotation or a sub pattern should appear in the main query pattern. Here, ({A})+n means one and up to n occurrences of annotation {A} and ({A})*n means zero or up to n occurrences of annotation {A}.

Below we explain the steps to search in SSD.

  1. Double click on SSD. You will see an extra tab “Lucene DataStore Searcher”. Click on it to activate the searcher GUI.
  2. Here you can specify a query to search in your SSD. The query here is a L.H.S. part of the JAPE grammar. Here are some examples:

    1. {Person} – This will return annotations of type Person from the SSD
    2. {Token.string == “Microsoft”} – This will return all occurrences of “Microsoft” from the SSD.
    3. {Person}({Token})*2{Organization} – Person followed by zero or up to two tokens followed by Organization.
    4. {Token.orth==“upperInitial”, Organization} – Token with feature orth with value set to “upperInitial” and which is also annotated as Organization.


PIC

Figure 9.2: Searchable Serial Datastore Viewer - Auto-completion.


9.2.3 Top Section [#]

A text-area located in the top left part of the GUI is used to input a query. You can copy/cut/paste with Control+C/X/V, undo/redo your changes with Control+Z/Y as usual. To add a new line, use Control+Enter key combination.

Auto-completion shown on the figure 9.2 for annotation type is triggered when typing ’{’ or ’,’ and for feature when typing ’.’ after a valid annotation type. It shows only the annotation types and features related to the selected corpus and annotation set.

If you right-click on an expression it will automatically select the shortest valid enclosing brace and if you click on a selection it will propose you to add quantifiers for allowing the expression to appear zero, one or more times.

To execute the query, click on the magnifying glass icon, use Enter key or Alt+Enter key combination. To clear the query, click on the red X icon or use Alt+Backspace key combination.

It is possible to have more than one corpus, each containing a different set of documents, stored in a single data-store. ANNIC, by providing a drop down box with a list of stored corpora, also allows searching within a specific corpus. Similarly a document can have more than one annotation set indexed and therefore ANNIC also provides a drop down box with a list of indexed annotation sets for the selected corpus.

A large corpus can have many hits for a given query. This may take a long time to refresh the GUI and may create inconvenience while browsing through results. ANNIC therefore allows you to specify a number of results that you wish to retrieve at once and provides a way to iterate through next pages with the Next Page of Results button. Due to technical complexities, it is not possible to visit a previous page. To retrieve all the results at the same time, push the results slider to the right end.

9.2.4 Central Section [#]

Annotation types and features to show can be configured from the stack view configuration window by clicking on the Configure button in the central section. You can also change the feature value to be displayed by double clicking on the annotation type name.

The central section shows coloured rectangles exactly below the spans of text where these annotations occur. If only an annotation type is displayed, the rectangle remains empty. When you hover the mouse over the rectangle, it shows all their features and values in a popup window. If an annotation type and a feature are displayed, the value of that feature is shown in the rectangle.

Shortcuts are expression that stand for an ”AnnotationType.Feature” expression. For example, on the figure 9.1, the shortcut ”POS” stands for the expression ”Token.category”.

When you double click on an annotation rectangle, the respective query expression is placed at the caret position in the query text area. If you have selected anything in the query text area, it gets replaced. You can also double click on a word on the first line to add it to the query.

9.2.5 Bottom Section [#]

The table of results contains the text matched by the query, the contexts, the features from the matched annotations, the effective query, the document and annotation set names. You can sort a table column by clicking on its header.

You can remove a result from the results table or open the document containing it by right-clicking on a result in the results table.

ANNIC provides an Export button to export results into an HTML file. You can also select then copy/paste the table in your word processor or spreadsheet.

A statistics tabbed pane is displayed on the bottom-right. There is always a global statistics pane that list the count of the occurrences of all annotation types for the selected corpus and annotation set.

Statistics can be obtained for matched spans of the query in the results, with or without contexts, just by annotation type, an annotation type + feature or an annotation type + feature + value. A second pane contains the one item statistics that you can add by right-clicking on a non empty annotation rectangle or on the header of a row in the central section. You can sort a table column by clicking on its header.

9.3 Using SSD from GATE Embedded [#]

 
//how to instantiate a searchabledatastore  
===============================  
 
// create an instance of datastore  
LuceneDataStoreImpl ds = (LuceneDataStoreImpl)  
Factory.createDataStore(‘‘gate.persist.LuceneDataStoreImpl’’, dsLocation);  
 
// we need to set Indexer  
Indexer indexer = new LuceneIndexer(new URL(indexLocation));  
 
// set the parameters  
Map parameters = new HashMap();  
 
// specify the index url  
parameters.put(Constants.INDEX_LOCATION_URL, new URL(indexLocation));  
 
// specify the base token type  
// and specify that the tokens should be created automatically  
// if not found in the document  
parameters.put(Constants.BASE_TOKEN_ANNOTATION_TYPE, ‘‘Token’’);  
parameters.put(Constants.CREATE_TOKENS_AUTOMATICALLY, new Boolean(true));  
 
// specify the index unit type  
parameters.put(Constants.INDEX_UNIT_ANNOTATION_TYPE, ‘‘Sentence’’);  
 
// specifying the annotation sets "Key" and "Default Annotation Set"  
// to be indexed  
List<String> setsToInclude = new ArrayList<String>();  
setsToInclude.add("Key");  
setsToInclude.add("<null>");  
parameters.put(Constants.ANNOTATION_SETS_NAMES_TO_INCLUDE, setsToInclude);  
parameters.put(Constants.ANNOTATION_SETS_NAMES_TO_EXCLUDE, new ArrayList<String>());  
 
// all features should be indexed  
parameters.put(Constants.FEATURES_TO_INCLUDE, new ArrayList<String>());  
parameters.put(Constants.FEATURES_TO_EXCLUDE, new ArrayList<String>());  
 
// set the indexer  
ds.setIndexer(indexer, parameters);  
 
// set the searcher  
ds.setSearcher(new LuceneSearcher());  
 
//how to search in this datastore  
//======================  
// obtain the searcher instance  
Searcher searcher = ds.getSearcher();  
Map parameters  = new HashMap();  
 
// obtain the url of index  
String indexLocation =  
new File(((URL) ds.getIndexer().getParameters().get(Constants.INDEX_LOCATION_URL))  
.getFile()).getAbsolutePath();  
ArrayList indexLocations = new ArrayList();  
indexLocations.add(indexLocation);  
 
 
// corpus2SearchIn = mention corpus name that was indexed here.  
 
// the annotation set to search in  
String annotationSet2SearchIn = "Key";  
 
// set the parameter  
parameters.put(Constants.INDEX_LOCATIONS,indexLocations);  
parameters.put(Constants.CORPUS_ID, corpus2SearchIn);  
parameters.put(Constants.ANNOTATION_SET_ID, annotationSet);  
parameters.put(Constants.CONTEXT_WINDOW, contextWindow);  
parameters.put(Constants.NO_OF_PATTERNS, noOfPatterns);  
 
// search  
String query = "{Person}";  
Hit[] hits = searcher.search(query, parameters);  

Chapter 10
Performance Evaluation of Language Analysers [#]

When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the stage of science. (Kelvin)

Not everything that counts can be counted, and not everything that can be counted counts. (Einstein)

GATE provides a variety of tools for automatic evaluation. The Annotation Diff tool compares two annotation sets within a document. Corpus QA extends Annotation Diff to an entire corpus. The Corpus Benchmark tool also provides functionality for comparing annotation sets over an entire corpus. Additionally, two plugins cover similar functionality; one implements inter-annotator agreement, and the other, the balanced distance metric.

These tools are particularly useful not just as a final measure of performance, but as a tool to aid system development by tracking progress and evaluating the impact of changes as they are made. Applications include evaluating the success of a machine learning or language engineering application by comparing its results to a gold standard and also comparing annotations prepared by two human annotators to each other to ensure that the annotations are reliable.

This chapter begins by introducing the concepts and metrics relevant, before describing each of the tools in turn.

10.1 Metrics for Evaluation in Information Extraction [#]

When we evaluate the performance of a processing resource such as tokeniser, POS tagger, or a whole application, we usually have a human-authored ‘gold standard’ against which to compare our software. However, it is not always easy or obvious what this gold standard should be, as different people may have different opinions about what is correct. Typically, we solve this problem by using more than one human annotator, and comparing their annotations. We do this by calculating inter-annotator agreement (IAA), also known as inter-rater reliability.

IAA can be used to assess how difficult a task is. This is based on the argument that if two humans cannot come to agreement on some annotation, it is unlikely that a computer could ever do the same annotation ‘correctly’. Thus, IAA can be used to find the ceiling for computer performance.

There are many possible metrics for reporting IAA, such as Cohen’s Kappa, prevalence, and bias [Eugenio & Glass 04]. Kappa is the best metric for IAA when all the annotators have identical exhaustive sets of questions on which they might agree or disagree. This could be a task like ‘read over this text and mark up all telephone numbers’. However, sometimes there is disagreement about the set of questions, e.g. when the annotators themselves determine which text spans they ought to annotate. That could be a task like ‘read over this text and mark up all references to politics’. When annotators determine their own sets of questions, it is appropriate to use precision, recall, and F-measure to report IAA. Precision, recall and F-measure are also appropriate choices when assessing performance of an automated application against a trusted gold standard.

In this section, we will first introduce some relevant terms, before outlining Cohen’s Kappa, in Section 10.1.2. We will then introduce precision, recall and F-measure in Section 10.1.3.

10.1.1 Annotation Relations [#]

Before introducing the metrics we will use in this chapter, we will first outline the ways in which annotations can relate to each other. These ways of comparing annotations to each other are used to determine the counts that then go into calculating the metrics of interest. Consider a document with two annotation sets upon it. These annotation sets might for example be prepared by two human annotators, or alternatively, one set might be produced by an automated system and the other might be a trusted gold standard. We wish to assess the extent to which they agree. We begin by counting incidences of the following relations:

Coextensive
Two annotations are coextensive if they hit the same span of text in a document. Basically, both their start and end offsets are equal.
Overlaps
Two annotations overlap if they share a common span of text.
Compatible
Two annotations are compatible if they are coextensive and if the features of one (usually the ones from the key) are included in the features of the other (usually the response).
Partially Compatible
Two annotations are partially compatible if they overlap and if the features of one (usually the ones from the key) are included in the features of the other (response).
Missing
This applies only to the key annotations. A key annotation is missing if either it is not coextensive or overlapping, orif one or more features are not included in the response annotation.
Spurious
This applies only to the response annotations. A response annotation is spurious if either it is not coextensive or overlapping, or if one or more features from the key are not included in the response annotation.

10.1.2 Cohen’s Kappa [#]

The three commonly used IAA measures are observed agreement, specific agreement, and Kappa (κ) [Hripcsak & Heitjan 02]. Those measures can be calculated from a contingency table, which lists the numbers of instances of agreement and disagreement between two annotators on each category. To explain the IAA measures, a general contingency table for two categories cat1 and cat2 is shown in Table 10.1.


Table 10.1: Contingency table for two-category problem




Annotator-2




Annotator-1 cat1 cat2marginal sum




cat1 a b a+b
cat2 c d c+d




marginal sum a+c b+d a+b+c+d





Observed agreement is the portion of the instances on which the annotators agree. For the two annotators and two categories as shown in Table 10.1, it is defined as

         a + d
Ao =  -------------
      a + b + c + d  (10.1)

The extension of the above formula to more than two categories is straightforward. The extension to more than two annotators is usually taken as the mean of the pair-wise agreements [Fleiss 75], which is the average agreement across all possible pairs of annotators. An alternative compares each annotator with the majority opinion of the others [Fleiss 75].

However, the observed agreement has two shortcomings. One is that a certain amount of agreement is expected by chance. The Kappa measure is a chance-corrected agreement. Another is that it sums up the agreement on all the categories, but the agreements on each category may differ. Hence the category specific agreement is needed.

Specific agreement quantifies the degree of agreement for each of the categories separately. For example, the specific agreement for the two categories list in Table 10.1 is the following, respectively,

        ----2a----           ----2d----
Acat1 = 2a + b + c;  Acat2 = b + c + 2d  (10.2)

Kappa is defined as the observed agreements Ao minus the agreement expected by chance Ae and is normalized as a number between -1 and 1.

κ = Ao----Ae
     1 - Ae  (10.3)

κ = 1 means perfect agreements, κ = 0 means the agreement is equal to chance, κ = -1 means ‘perfect’ disagreement.

There are two different ways of computing the chance agreement Ae (for a detailed explanations about it see [Eugenio & Glass 04]). The Cohen’s Kappa is based on the individual distribution of each annotator, while the Siegel & Castellan’s Kappa is based on the assumption that all the annotators have the same distribution. The former is more informative than the latter and has been used widely.

The Kappa suffers from the prevalence problem which arises because imbalanced distribution of categories in the data increases Ae. The prevalence problem can be alleviated by reporting the positive and negative specified agreement on each category besides the Kappa [Hripcsak & Heitjan 02Eugenio & Glass 04]. In addition, the so-called bias problem affects the Cohen’s Kappa, but not S&C’s. The bias problem arises as one annotator prefers one particular category more than another annotator. [Eugenio & Glass 04] advised to compute the S&C’s Kappa and the specific agreements along with the Cohen’s Kappa in order to handle these problems.

Despite the problem mentioned above, the Cohen’s Kappa remains a popular IAA measure. Kappa can be used for more than two annotators based on pair-wise figures, e.g. the mean of all the pair-wise Kappa as an overall Kappa measure. The Cohen’s Kappa can also be extended to the case of more than two annotators by using the following single formula [Davies & Fleiss 82]

                          2  ∑  ∑    2
        ---------------IJ------i--c-Yic---------------
κ = 1 - I(J (J - 1)∑c (pc(1 - pc)) + ∑c ∑j (pcj - pc)2)  (10.4)

Where I and J are the number of instances and annotators, respectively; Y ic is the number of annotators who assigns the category c to the instance I; pcj is the probability of the annotator j assigning category c; pc is the probability of assigning category by all annotators (i.e. averaging pcj over all annotators).

S&C’s Kappa is applicable for any number of annotators. S&C’s Kappa for two annotators is also known as Scott’s Pi (see [Lombard et al. 02]). The Krippendorff’s alpha, another variant of Kappa, differs only slightly from the S&C’s Kappa on nominal category problem (see [Carletta 96Eugenio & Glass 04]).

However, note that the Kappa (and the observed agreement) is not applicable to some tasks. Named entity annotation is one such task [Hripcsak & Rothschild 05]. In the named entity annotation task, annotators are given some text and are asked to annotate some named entities (and possibly their categories) in the text. Different annotators may annotate different instances of the named entity. So, if one annotator annotates one named entity in the text but another annotator does not annotate it, then that named entity is a non-entity for the latter. However, generally the non-entity in the text is not a well-defined term, e.g. we don’t know how many words should be contained in the non-entity. On the other hand, if we want to compute Kappa for named entity annotation, we need the non-entities. This is why people don’t compute Kappa for the named entity task.

10.1.3 Precision, Recall, F-Measure [#]

Much of the research in IE in the last decade has been connected with the MUC competitions, and so it is unsurprising that the MUC evaluation metrics of precision, recall and F-measure [Chinchor 92] also tend to be used, along with slight variations. These metrics have a very long-standing tradition in the field of IR [van Rijsbergen 79] (see also [Manning & Schütze 99Frakes & Baeza-Yates 92]).

Precision measures the number of correctly identified items as a percentage of the number of items identified. In other words, it measures how many of the items that the system identified were actually correct, regardless of whether it also failed to retrieve correct items. The higher the precision, the better the system is at ensuring that what is identified is correct.

Error rate is the inverse of precision, and measures the number of incorrectly identified items as a percentage of the items identified. It is sometimes used as an alternative to precision.

Recall measures the number of correctly identified items as a percentage of the total number of correct items. In other words, it measures how many of the items that should have been identified actually were identified, regardless of how many spurious identifications were made. The higher the recall rate, the better the system is at not missing correct items.

Clearly, there must be a tradeoff between precision and recall, for a system can easily be made to achieve 100% precision by identifying nothing (and so making no mistakes in what it identifies), or 100% recall by identifying everything (and so not missing anything). The F-measure [van Rijsbergen 79] is often used in conjunction with Precision and Recall, as a weighted average of the two. False positives are a useful metric when dealing with a wide variety of text types, because it is not dependent on relative document richness in the same way that precision is. By this we mean the relative number of entities of each type to be found in a set of documents.

When comparing different systems on the same document set, relative document richness is unimportant, because it is equal for all systems. When comparing a single system’s performance on different documents, however, it is much more crucial, because if a particular document type has a significantly different number of any type of entity, the results for that entity type can become skewed. Compare the impact on precision of one error where the total number of correct entities = 1, and one error where the total = 100. Assuming the document length is the same, then the false positive score for each text, on the other hand, should be identical.

Common metrics for evaluation of IE systems are defined as follows:

             ----Correct-+--1∕2P-artial----
P recision =  Correct + Spurious  + P artial  (10.5)

             Correct  + 1∕2P artial
Recall = ------------------------------
         Correct  + M issing + P artial  (10.6)

                (β2 + 1)P * R
F - measure   = ----2---------
                  (β R ) + P  (10.7)

where β reflects the weighting of P vs. R. If β is set to 1, the two are weighted equally.

F alseP ositive = Spurious--
                     c  (10.8)

where c is some constant independent from document richness, e.g. the number of tokens or sentences in the document.

Note that we consider annotations to be partially correct if the entity type is correct and the spans are overlapping but not identical. Partially correct responses are normally allocated a half weight.

10.1.4 Macro and Micro Averaging [#]

Where precision, recall and f-measure are calculated over a corpus, there are options in terms of how document statistics are combined.

The method of choice depends on the priorities of the case in question. Macro averaging tends to increase the importance of shorter documents.

It is also possible to calculate a macro average across annotation types; that is to say, precision, recall and f-measure are calculated separately for each annotation type and the results then averaged.

10.2 The Annotation Diff Tool [#]

The Annotation Diff tool enables two sets of annotations in one or two documents to be compared, in order either to compare a system-annotated text with a reference (hand-annotated) text, or to compare the output of two different versions of the system (or two different systems). For each annotation type, figures are generated for precision, recall, F-measure. Each of these can be calculated according to 3 different criteria - strict, lenient and average. The reason for this is to deal with partially correct responses in different ways.

It can be accessed both from GATE Developer and from GATE Embedded. Annotation Diff compares sets of annotations with the same type. When performing the comparison, the annotation offsets and their features will be taken into consideration. and after that, the comparison process is triggered. Figure 10.1 shows the Annotation Diff window.


PIC


Figure 10.1: Annotation diff window with the parameters at the top, the comparison table in the center and the adjudication panel at the bottom.


All annotations from the key set are compared with the ones from the response set, and those found to have the same start and end offsets are displayed on the same line in the table. Then, the Annotation Diff evaluates if the features of each annotation from the response set subsume those features from the key set, as specified by the features names you provide.

In order to create a gold standard set from two sets you need to show the ‘Adjudication’ panel at the bottom. It will insert two checkboxes columns in the central table. Tick boxes in the ‘K(ey)’ and ‘R(esponse)’ then input a Target set in the text field and use the ‘Copy selection to target’ button to copy all annotations selected to the target annotation set. There is a context menu for the checkboxes to tick them quickly.

To use the annotation diff tool, see Section 10.2.1. To compare more than two annotation sets, see Section 3.4.3.

10.2.1 Performing Evaluation with the Annotation Diff Tool [#]


PIC


Figure 10.2: Annotation diff window with the parameters at the top, the comparison table in the center and the statistics panel at the bottom.


The Annotation Diff tool is activated by selecting it from the Tools menu at the top of the GATE Developer window. It will appear in a new window. Select the key and response documents to be used (note that both must have been previously loaded into the system), the annotation sets to be used for each, and the annotation type to be compared.

Note that the tool automatically intersects all the annotation types from the selected key annotation set with all types from the response set.

On a separate note, you can perform a diff on the same document, between two different annotation sets. One annotation set could contain the key type and another could contain the response one.

After the type has been selected, the user is required to decide how the features will be compared. It is important to know that the tool compares them by analysing if features from the key set are contained in the response set. It checks for both the feature name and feature value to be the same.

There are three basic options to select:

The weight for the F-Measure can also be changed - by default it is set to 1.0 (i.e. to give precision and recall equal weight). Finally, click on ‘Compare’ to display the results. Note that the window may need to be resized manually, by dragging the window edges as appropriate).

In the main window, the key and response annotations will be displayed. They can be sorted by any category by clicking on the central column header: ‘=?’. The key and response annotations will be aligned if their indices are identical, and are color coded according to the legend displayed at the bottom.

Precision, recall, F-measure are also displayed below the annotation tables, each according to 3 criteria - strict, lenient and average. See Sections 10.2 and 10.1 for more details about the evaluation metrics.

The results can be saves to an HTML file by using the ‘Export to HTML’ button. This creates an HTML snapshot of what the Annotation Diff table shows at that moment. The columns and rows in the table will be shown in the same order, and the hidden columns will not appear in the HTML file. The colours will also be the same.

If you need more details or context you can use the button ‘Show document’ to display the document and the annotations selected in the annotation diff drop down lists and table.

10.3 Corpus Quality Assurance [#]


PIC


Figure 10.3: Corpus Quality assurance showing the document statistics table


10.3.1 Description of the interface

A bottom tab in each corpus view is entitled ‘Corpus Quality Assurance’. This tab will allow you to calculate precision, recall and F-score between two annotation sets in a corpus without the need to load a plugin. It extends the Annotation Diff functionality to the entire corpus in a convenient interface.

The main part of the view consists of two tabs each containing a table. One tab is entitled ‘Corpus statistics’ and the other is entitled ‘Document statistics’.

To the right of the tabbed area is a configuration pane in which you can select the annotation sets you wish to compare, the annotation types you are interested in and the annotation features you wish to specify for use in the calculation if any.

You can also choose whether to calculate agreement on a strict or lenient basis or take the average of the two. (Recall that strict matching requires two annotations to have an identical span if they are to be considered a match, where lenient matching accepts a partial match; annotations are overlapping but not identical in span.)

At the top, four icons are for opening a document or Annotation Diff only when a row in the document statistics table is selected, exporting the two tables to an HTML file and reloading the list of sets, types and features when some documents have been modified in the corpus.

Corpus Quality Assurance works also with a corpus inside a datastore. Using a datastore is useful to minimise memory consumption when you have a big corpus.

10.3.2 Step by step usage

Begin by selecting the annotation sets you wish to compare in the top list in the configuration pane. Clicking on an annotation set labels it annotation set A (an ‘(A)’ will appear beside it to indicate that this is your selection for annotation set A). Now click on another annotation set. This will be labelled annotation set B.

To change your selection, deselect an annotation set by clicking on it a second time. You can now choose another annotation set. Note that you do not need to hold the control key down to select the second annotation set. This list is configured to accept two (and no more than two) selections. If you wish, you may check the box ‘present in every document’ to reduce the annotation sets list to only those sets present in every document.

You may now choose the annotation types you are interested in. If you don’t choose any then all will be used. If you wish, you may check the box ‘present in every selected set’ to reduce the annotation types list to only those present in every selected annotation set.

Optionally you can choose the annotation features you wish to include in the calculation. If you choose features, then for an annotation to be considered a match to another, their feature values must also match. If you select the box ‘present in every selected type’ the features list will be reduced to only those present in every type you selected.

The ‘Measures’ list allows you to choose whether to calculate strict or lenient figures or average the two. You may choose as many as you wish, and they will be included as columns in the table to the left.

Finally, click on the ‘Compare’ button to recalculate the tables. The figures that appear in the two tables (one per tab) are described below.

10.3.3 Details of the Corpus statistics table

In this table you will see that one row appears for every annotation type you chose. Columns give total counts for matching annotations (‘Match’), annotations only present in annotation set A (‘Only A’), annotations only present in annotation set B (‘Only B’) and annotations that overlapped (‘Overlap’).

Depending on whether one of your annotation sets is considered a gold standard, you might prefer to think of ‘Only A’ as missing and ‘Only B’ as spurious, or vice versa, but the Corpus Quality Assurance tool makes no assumptions about which if any annotation set is the gold standard. Where it is being used to calculate Inter Annotator Agreement there is no concept of a ‘correct’ set. However, in ‘MUC’ terms, ‘Match’ would be correct and ‘Overlap’ would be partial.

After these columns, three columns appear for every measure you chose to calculate. If you chose to calculate a strict F1, a recall, precision and F1 column will appear for the strict counts. If you chose to calculate a lenient F1, precision, recall and F1 columns will also appear for lenient counts.

In the corpus statistics table, calculations are done on a per type basis and include all documents in the calculation. Final rows in the table provide summaries; total counts are given along with a micro and a macro average.

Micro averaging treats the entire corpus as one big document where macro averaging, on this table, is the arithmetic mean of the per-type figures. See Section 10.1.4 for more detail on the distinction between a micro and a macro average.

10.3.4 Details of the Document statistics table

In this table you will see that one row appears for every document in the corpus. Columns give counts as in the corpus statistics table, but this time on a per-document basis.

As before, for every measure you choose to calculate, precision, recall and F1 columns will appear in the table.

Summary rows, again, give a macro average (arithmetic mean of the per-document measures) and micro average (identical to the figure in the corpus statistics table).

10.4 Corpus Benchmark Tool [#]

Like the Corpus Quality Assurance functionality, the corpus benchmark tool enables evaluation to be carried out over a whole corpus rather than a single document. Unlike Corpus QA, it uses matched corpora to achieve this, rather than comparing annotation sets within a corpus. It enables tracking of the system’s performance over time. It provides more detailed information regarding the annotations that differ between versions of the corpus (e.g. annotations created by different versions of an application) than the Corpus QA tool does.

The basic idea with the tool is to evaluate an application with respect to a ‘gold standard’. You have a ‘marked’ corpus containing the gold standard reference annotations; you have a ‘clean’ copy of the corpus that does not contain the annotations in question, and you have an application that creates the annotations in question. Now you can see how you are getting on, by comparing the result of running your application on ‘clean’ to the ‘marked’ annotations.

10.4.1 Preparing the Corpora for Use [#]

You will need to prepare the following directory structure:

main directory (can have any name)  
|  
|__"clean" (directory containing unannotated documents in XML form)  
|  
|__"marked" (directory containing annotated documents in XML form)  
|  
|__"processed" (directory containing the datastore which is generated  
               when you ‘store corpus for future evaluation’)

10.4.2 Defining Properties [#]

The properties of the corpus benchmark tool are defined in the file ‘corpus_tool.properties’, which should be located in the GATE home directory. GATE will tell you where it’s looking for the properties file in the ‘message’ panel when you run the Corpus Benchmark Tool. It is important to prepare this file before attempting to run the tool because there is no file present by default, so unless you prepare this file, the corpus benchmark tool will not work!

The following properties should be set:

The default annotation set has to be represented by an empty string. The outputSetName and annotSetName must be different, and cannot both be the default annotation set. (If they are the same, then use the Annotation Set Transfer PR to change one of them.) If you omit any line (or just leave the value blank), that property reverts to default. For example, ‘annotSetName=’ is the same as leaving that line out.

An example file is shown below:

threshold=0.7  
annotSetName=Key  
outputSetName=ANNIE  
annotTypes=Person;Organization;Location;Date;Address;Money  
annotFeatures=type;gender

Here is another example:

threshold=0.6  
annotSetName=Filtered  
outputSetName=  
annotTypes=Mention  
annotFeatures=class

10.4.3 Running the Tool [#]

To use the tool, first make sure the properties of the tool have been set correctly (see Section 10.4.2 for how to do this) and that the corpora and directory structure have been prepared as outlined in Section 10.4.1. Also, make sure that your application is saved to file (see Section 3.8.3). Then, from the ‘Tools’ menu, select ‘Corpus Benchmark’. You have four options:

  1. Default Mode