This documentation is for the latest snapshot of GATE Developer/Embedded.
If you are using a release version you should refer to the documentation for that
release instead, which is accessible via the Help menu in GATE Developer.
The latest release is version 8.6.1
Developing Language Processing
Components with GATE
Version 9 (a User Guide)
For GATE version 9.1-SNAPSHOT (development builds)
(built August 16, 2023)
Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj
Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic,
Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, Wim
Peters, Leon Derczynski, et al
©The University of Sheffield, Department of Computer Science 2001-2023
https://gate.ac.uk/
Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT) and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), Ontotext Matrixware, the Information Retrieval Facility and several EU-funded projects: (TrendMiner, uComp, Arcomem, SEKT, TAO, NeOn, MediaCampaign, Musing, KnowledgeWeb, PrestoSpace, h-TechSight, and enIRaF).
Contents
1 Introduction
1.1 How to Use this Text
1.2 Context
1.3 Overview
1.3.1 Developing and Deploying Language Processing Facilities
1.3.2 Built-In Components
1.3.3 Additional Facilities in GATE Developer/Embedded
1.3.4 An Example
1.4 Some Evaluations
1.5 Recent Changes
1.5.1 Version 9.0.1 (March 2021)
1.5.2 Version 9.0 (February 2021)
1.5.3 Version 8.6.1 (January 2020)
1.5.4 Version 8.6 (June 2019)
1.5.5 Version 8.5.1 (June 2018)
1.5.6 Version 8.5 (May 2018)
1.6 Further Reading
2 Installing and Running GATE
2.1 Downloading GATE
2.2 Installing and Running GATE
2.2.1 The Easy Way
2.2.2 The Hard Way (1)
2.2.3 The Hard Way (2): Git
2.2.4 Running GATE Developer on Unix/Linux
2.3 Using System Properties with GATE
2.4 Changing GATE’s launch configuration
2.5 Configuring GATE
2.6 Building GATE
2.6.1 Using GATE with Maven/Ivy
2.7 Uninstalling GATE
2.8 Troubleshooting
3 Using GATE Developer
3.1 The GATE Developer Main Window
3.2 Loading and Viewing Documents
3.3 Creating and Viewing Corpora
3.4 Working with Annotations
3.4.1 The Annotation Sets View
3.4.2 The Annotations List View
3.4.3 The Annotations Stack View
3.4.4 The Co-reference Editor
3.4.5 Creating and Editing Annotations
3.4.6 Schema-Driven Editing
3.4.7 Printing Text with Annotations
3.5 Using CREOLE Plugins
3.6 Installing and updating CREOLE Plugins
3.7 Loading and Using Processing Resources
3.8 Creating and Running an Application
3.8.1 Running an Application on a Datastore
3.8.2 Running PRs Conditionally on Document Features
3.8.3 Doing Information Extraction with ANNIE
3.8.4 Modifying ANNIE
3.9 Saving Applications and Language Resources
3.9.1 Saving Documents to File
3.9.2 Saving and Restoring LRs in Datastores
3.9.3 Saving Application States to a File
3.9.4 Saving an Application with its Resources (e.g. GATE Cloud)
3.9.5 Upgrade An Application to use Newer Versions of Plugins
3.10 Keyboard Shortcuts
3.11 Miscellaneous
3.11.1 Stopping GATE from Restoring Developer Sessions/Options
3.11.2 Working with Unicode
4 CREOLE: the GATE Component Model
4.1 The Web and CREOLE
4.2 The GATE Framework
4.3 The Lifecycle of a CREOLE Resource
4.4 Processing Resources and Applications
4.5 Language Resources and Datastores
4.6 Built-in CREOLE Resources
4.7 CREOLE Resource Configuration
4.7.1 Configuring Resources using Annotations
4.7.2 Loading Third-Party Libraries in a Maven plugin
4.8 Tools: How to Add Utilities to GATE Developer
4.8.1 Putting Your Tools in a Sub-Menu
4.8.2 Adding Tools To Existing Resource Types
5 Language Resources: Corpora, Documents and Annotations
5.1 Features: Simple Attribute/Value Data
5.2 Corpora: Sets of Documents plus Features
5.3 Documents: Content plus Annotations plus Features
5.4 Annotations: Directed Acyclic Graphs
5.4.1 Annotation Schemas
5.4.2 Examples of Annotated Documents
5.4.3 Creating, Viewing and Editing Diverse Annotation Types
5.5 Document Formats
5.5.1 Detecting the Right Reader
5.5.2 XML
5.5.3 HTML
5.5.4 SGML
5.5.5 Plain text
5.5.6 RTF
5.5.7 Email
5.5.8 PDF Files and Office Documents
5.5.9 UIMA CAS Documents
5.5.10 CoNLL/IOB Documents
5.6 XML Input/Output
6 ANNIE: a Nearly-New Information Extraction System
6.1 Document Reset
6.2 Tokeniser
6.2.1 Tokeniser Rules
6.2.2 Token Types
6.2.3 English Tokeniser
6.3 Gazetteer
6.4 Sentence Splitter
6.5 RegEx Sentence Splitter
6.6 Part of Speech Tagger
6.7 Semantic Tagger
6.8 Orthographic Coreference (OrthoMatcher)
6.8.1 GATE Interface
6.8.2 Resources
6.8.3 Processing
6.9 Pronominal Coreference
6.9.1 Quoted Speech Submodule
6.9.2 Pleonastic It Submodule
6.9.3 Pronominal Resolution Submodule
6.9.4 Detailed Description of the Algorithm
6.10 A Walk-Through Example
6.10.1 Step 1 - Tokenisation
6.10.2 Step 2 - List Lookup
6.10.3 Step 3 - Grammar Rules
II GATE for Advanced Users
7 GATE Embedded
7.1 Quick Start with GATE Embedded
7.2 Resource Management in GATE Embedded
7.3 Using CREOLE Plugins
7.4 Language Resources
7.4.1 GATE Documents
7.4.2 Feature Maps
7.4.3 Annotation Sets
7.4.4 Annotations
7.4.5 GATE Corpora
7.5 Processing Resources
7.6 Controllers
7.7 Modelling Relations between Annotations
7.8 Duplicating a Resource
7.8.1 Sharable properties
7.9 Persistent Applications
7.10 Ontologies
7.11 Loading Annotation Schemas
7.12 Creating a New CREOLE Resource
7.12.1 Dependencies
7.13 Adding Support for a New Document Format
7.14 Using GATE Embedded in a Multithreaded Environment
7.15 Using GATE Embedded within a Spring Application
7.15.1 Duplication in Spring
7.15.2 Spring pooling
7.15.3 Further reading
7.16 Groovy for GATE
7.16.1 Groovy Scripting Console for GATE
7.16.2 Groovy scripting PR
7.16.3 The Scriptable Controller
7.16.4 Utility methods
7.17 Saving Config Data to gate.xml
7.18 Annotation merging through the API
7.19 Using Resource Helpers to Extend the API
7.20 Converting a Directory Plugin to a Maven Plugin
8 JAPE: Regular Expressions over Annotations
8.1 The Left-Hand Side
8.1.1 Matching Entire Annotation Types
8.1.2 Using Features and Values
8.1.3 Using Meta-Properties
8.1.4 Building complex patterns from simple patterns
8.1.5 Matching a Simple Text String
8.1.6 Using Templates
8.1.7 Multiple Pattern/Action Pairs
8.1.8 LHS Macros
8.1.9 Multi-Constraint Statements
8.1.10 Using Context
8.1.11 Negation
8.1.12 Escaping Special Characters
8.2 LHS Operators in Detail
8.2.1 Equality Operators
8.2.2 Comparison Operators
8.2.3 Regular Expression Operators
8.2.4 Contextual Operators
8.2.5 Custom Operators
8.3 The Right-Hand Side
8.3.1 A Simple Example
8.3.2 Copying Feature Values from the LHS to the RHS
8.3.3 Optional or Empty Labels
8.3.4 RHS Macros
8.4 Use of Priority
8.5 Using Phases Sequentially
8.6 Using Java Code on the RHS
8.6.1 A More Complex Example
8.6.2 Adding a Feature to the Document
8.6.3 Finding the Tokens of a Matched Annotation
8.6.4 Using Named Blocks
8.6.5 Java RHS Overview
8.7 Optimising for Speed
8.8 Ontology Aware Grammar Transduction
8.9 Serializing JAPE Transducer
8.9.1 How to Serialize?
8.9.2 How to Use the Serialized Grammar File?
8.10 Notes for Montreal Transducer Users
8.11 JAPE Plus
9 ANNIC: ANNotations-In-Context
9.1 Instantiating SSD
9.2 Search GUI
9.2.1 Overview
9.2.2 Syntax of Queries
9.2.3 Top Section
9.2.4 Central Section
9.2.5 Bottom Section
9.3 Using SSD from GATE Embedded
9.3.1 How to instantiate a searchabledatastore
9.3.2 How to search in this datastore
10 Performance Evaluation of Language Analysers
10.1 Metrics for Evaluation in Information Extraction
10.1.1 Annotation Relations
10.1.2 Cohen’s Kappa
10.1.3 Precision, Recall, F-Measure
10.1.4 Macro and Micro Averaging
10.2 The Annotation Diff Tool
10.2.1 Performing Evaluation with the Annotation Diff Tool
10.2.2 Creating a Gold Standard with the Annotation Diff Tool
10.2.3 A warning about feature values
10.3 Corpus Quality Assurance
10.3.1 Description of the interface
10.3.2 Step by step usage
10.3.3 Details of the Corpus statistics table
10.3.4 Details of the Document statistics table
10.3.5 GATE Embedded API for the measures
10.3.6 A warning about feature values
10.3.7 Quality Assurance PR
10.4 Corpus Benchmark Tool
10.4.1 Preparing the Corpora for Use
10.4.2 Defining Properties
10.4.3 Running the Tool
10.4.4 The Results
10.5 A Plugin Computing Inter-Annotator Agreement (IAA)
10.5.1 IAA for Classification
10.5.2 IAA For Named Entity Annotation
10.5.3 The BDM-Based IAA Scores
10.6 A Plugin Computing the BDM Scores for an Ontology
10.6.1 Computing BDM from embedded code
10.7 Quality Assurance Summariser for Teamware
11 Profiling Processing Resources
11.1 Overview
11.1.1 Features
11.1.2 Limitations
11.2 Graphical User Interface
11.3 Command Line Interface
11.4 Application Programming Interface
11.4.1 Log4j.properties
11.4.2 Benchmark log format
11.4.3 Enabling profiling
11.4.4 Reporting tool
12 Developing GATE
12.1 Reporting Bugs and Requesting Features
12.2 Contributing Patches
12.3 Creating New Plugins
12.3.1 What to Call your Plugin
12.3.2 Writing a New PR
12.3.3 Writing a New VR
12.3.4 Writing a ‘Ready Made’ Application
12.3.5 Distributing Your New Plugins
12.4 Adding your plugin to the default list
12.5 Updating this User Guide
12.5.1 Building the User Guide
12.5.2 Making Changes to the User Guide
III CREOLE Plugins
13 Gazetteers
13.1 Introduction to Gazetteers
13.2 ANNIE Gazetteer
13.2.1 Creating and Modifying Gazetteer Lists
13.2.2 ANNIE Gazetteer Editor
13.3 OntoGazetteer
13.4 Gaze Ontology Gazetteer Editor
13.4.1 The Gaze Gazetteer List and Mapping Editor
13.4.2 The Gaze Ontology Editor
13.5 Hash Gazetteer
13.5.1 Prerequisites
13.5.2 Parameters
13.6 Flexible Gazetteer
13.7 Gazetteer List Collector
13.8 OntoRoot Gazetteer
13.8.1 How Does it Work?
13.8.2 Initialisation of OntoRoot Gazetteer
13.8.3 Simple steps to run OntoRoot Gazetteer
13.9 Large KB Gazetteer
13.9.1 Quick usage overview
13.9.2 Dictionary setup
13.9.3 Additional dictionary configuration
13.9.4 Dictionary for Gazetteer List Files
13.9.5 Processing Resource Configuration
13.9.6 Runtime configuration
13.9.7 Semantic Enrichment PR
13.10 The Shared Gazetteer for multithreaded processing
13.11 Extended Gazetteer
13.12 Feature Gazetteer
14 Working with Ontologies
14.1 Data Model for Ontologies
14.1.1 Hierarchies of Classes and Restrictions
14.1.2 Instances
14.1.3 Hierarchies of Properties
14.1.4 URIs
14.2 Ontology Event Model
14.2.1 What Happens when a Resource is Deleted?
14.3 The Ontology Plugin
14.3.1 Upgrading from previous versions of GATE
14.3.2 The OWLIMOntology Language Resource
14.3.3 The ConnectSesameOntology Language Resource
14.3.4 The CreateSesameOntology Language Resource
14.3.5 The OWLIM2 Backwards-Compatible Language Resource
14.3.6 Using Ontology Import Mappings
14.3.7 Using BigOWLIM
14.3.8 The sesameCLI command line interface
14.4 GATE Ontology Editor
14.5 Ontology Annotation Tool
14.5.1 Viewing Annotated Text
14.5.2 Editing Existing Annotations
14.5.3 Adding New Annotations
14.5.4 Options
14.6 Relation Annotation Tool
14.6.1 Description of the two views
14.6.2 Create new annotation and instance from text selection
14.6.3 Create new annotation and add label to existing instance from text selection
14.6.4 Create and set properties for annotation relation
14.6.5 Delete instance, label or property
14.6.6 Differences with OAT and Ontology Editor
14.7 Using the ontology API
14.8 Ontology-Aware JAPE Transducer
14.9 Annotating Text with Ontological Information
14.10 Populating Ontologies
15 Non-English Language Support
15.1 Language Identification
15.1.1 The Optimaize Language Detector
15.1.2 Language Identification with TextCat
15.1.3 Fingerprint Generation
15.2 French Plugin
15.3 German Plugin
15.4 Romanian Plugin
15.5 Arabic Plugin
15.6 Chinese Plugin
15.6.1 Chinese Word Segmentation
15.7 Hindi Plugin
15.8 Russian Plugin
15.9 Bulgarian Plugin
15.10 Danish Plugin
15.11 Welsh Plugin
16 Domain Specific Resources
16.1 Biomedical Support
16.1.1 ABNER
16.1.2 MetaMap
16.1.3 GSpell biomedical spelling suggestion and correction
16.1.4 BADREX
16.1.5 MiniChem/Drug Tagger
16.1.6 AbGene
16.1.7 GENIA
16.1.8 Penn BioTagger
16.1.9 MutationFinder
17 Tools for Social Media Data
17.1 Tools for Twitter
17.2 Twitter JSON format
17.2.1 Entity annotations in JSON
17.3 Exporting GATE documents as JSON
17.4 Low-level PRs for Tweets
17.5 Handling multi-word hashtags
17.6 The TwitIE Pipeline
18 Parsers
18.1 SUPPLE Parser
18.1.1 Requirements
18.1.2 Building SUPPLE
18.1.3 Running the Parser in GATE
18.1.4 Viewing the Parse Tree
18.1.5 System Properties
18.1.6 Configuration Files
18.1.7 Parser and Grammar
18.1.8 Mapping Named Entities
18.2 Stanford Parser
18.2.1 Input Requirements
18.2.2 Initialization Parameters
18.2.3 Runtime Parameters
19 Machine Learning
19.1 Brief introduction to machine learning in GATE
20 Tools for Alignment Tasks
20.1 Introduction
20.2 The Tools
20.2.1 Compound Document
20.2.2 CompoundDocumentFromXml
20.2.3 Compound Document Editor
20.2.4 Composite Document
20.2.5 DeleteMembersPR
20.2.6 SwitchMembersPR
20.2.7 Saving as XML
20.2.8 Alignment Editor
20.2.9 Saving Files and Alignments
20.2.10 Section-by-Section Processing
21 Crowdsourcing Data with GATE
21.1 The Basics
21.2 Entity classification
21.2.1 Creating a classification job
21.2.2 Loading data into a job
21.2.3 Importing the results
21.2.4 Automatic adjudication
21.3 Entity annotation
21.3.1 Creating an annotation job
21.3.2 Loading data into a job
21.3.3 Importing the results
21.3.4 Automatic adjudication
22 Combining GATE and UIMA
22.1 Embedding a UIMA AE in GATE
22.1.1 Mapping File Format
22.1.2 The UIMA Component Descriptor
22.1.3 Using the AnalysisEnginePR
22.2 Embedding a GATE CorpusController in UIMA
22.2.1 Mapping File Format
22.2.2 The GATE Application Definition
22.2.3 Configuring the GATEApplicationAnnotator
23 More (CREOLE) Plugins
23.1 Verb Group Chunker
23.2 Noun Phrase Chunker
23.2.1 Differences from the Original
23.2.2 Using the Chunker
23.3 TaggerFramework
23.3.1 TreeTagger—Multilingual POS Tagger
23.3.2 GENIA and Double Quotes
23.4 Chemistry Tagger
23.4.1 Using the Tagger
23.5 TextRazor Annotation Service
23.6 Annotating Numbers
23.6.1 Numbers in Words and Numbers
23.6.2 Roman Numerals
23.7 Annotating Measurements
23.8 Annotating and Normalizing Dates
23.9 Snowball Based Stemmers
23.9.1 Algorithms
23.10 GATE Morphological Analyzer
23.10.1 Rule File
23.11 Flexible Exporter
23.12 Configurable Exporter
23.13 Annotation Set Transfer
23.14 Schema Enforcer
23.15 Information Retrieval in GATE
23.15.1 Using the IR Functionality in GATE
23.15.2 Using the IR API
23.16 WordNet in GATE
23.16.1 The WordNet API
23.17 Kea - Automatic Keyphrase Detection
23.17.1 Using the ‘KEA Keyphrase Extractor’ PR
23.17.2 Using Kea Corpora
23.18 Annotation Merging Plugin
23.19 Copying Annotations between Documents
23.20 LingPipe Plugin
23.20.1 LingPipe Tokenizer PR
23.20.2 LingPipe Sentence Splitter PR
23.20.3 LingPipe POS Tagger PR
23.20.4 LingPipe NER PR
23.20.5 LingPipe Language Identifier PR
23.21 OpenNLP Plugin
23.21.1 Init parameters and models
23.21.2 OpenNLP PRs
23.21.3 Obtaining and generating models
23.22 Stanford CoreNLP
23.22.1 Stanford Tagger
23.22.2 Stanford Parser
23.22.3 Stanford Named Entity Recognition
23.23 Content Detection Using Boilerpipe
23.24 Inter Annotator Agreement
23.25 Schema Annotation Editor
23.26 Coref Tools Plugin
23.27 Pubmed Format
23.28 MediaWiki Format
23.29 Fast Infoset Document Format
23.30 GATE JSON Document Format
23.31 Bdoc Format (JSON, YAML, MsgPack)
23.32 DataSift Document Format
23.33 CSV Document Support
23.34 TermRaider term extraction tools
23.34.1 Termbank language resources
23.34.2 Termbank Score Copier
23.34.3 The PMI bank language resource
23.35 Document Normalizer
23.36 Developer Tools
23.37 Linguistic Simplifier
23.38 GATE-Time
23.38.1 DCTParser
23.38.2 HeidelTime
23.38.3 TimeML Event Detection
23.39 StringAnnotation Plugin
23.40 CorpusStats Plugin
23.41 ModularPipelines Plugin
23.42 Java Plugin
23.43 Python Plugin
IV The GATE Family: Cloud, MIMIR, Teamware
24 GATE Cloud
24.1 GATE Cloud services: an overview
24.2 Using GATE Cloud services
24.3 Annotation Jobs on GATE Cloud
24.3.1 The Annotation Service Charges Explained
24.3.2 Where to find more details
24.4 GATE Cloud Pipeline URLs
25 GATE Teamware: A Web-based Collaborative Corpus Annotation Tool
25.1 Introduction
25.2 Requirements for Multi-Role Collaborative Annotation Environments
25.2.1 Typical Division of Labour
25.2.2 Remote, Scalable Data Storage
25.2.3 Automatic annotation services
25.2.4 Workflow Support
25.3 Teamware: Architecture, Implementation, and Examples
25.3.1 Data Storage Service
25.3.2 Annotation Services
25.3.3 The Executive Layer
25.3.4 The User Interfaces
25.4 Practical Applications
26 GATE Mímir
Appendices
A Change Log
A.1 Version 9.0.1 (March 2021)
A.2 Version 9.0 (February 2021)
A.3 Version 8.6.1 (January 2020)
A.4 Version 8.6 (June 2019)
A.5 Version 8.5.1 (June 2018)
A.6 Version 8.5 (May 2018)
A.6.1 For developers
A.7 Version 8.4.1 (June 2017)
A.8 Version 8.4 (February 2017)
A.8.1 Java compatibility
A.9 Version 8.3 (January 2017)
A.9.1 Java compatibility
A.10 Version 8.2 (May 2016)
A.10.1 Java compatibility
A.11 Version 8.1 (June 2015)
A.11.1 New plugins and significant new features
A.11.2 Library updates and bugfixes
A.11.3 Tools for developers
A.12 Version 8.0 (May 2014)
A.12.1 Major changes
A.12.2 Other new and improved plugins
A.12.3 Bug fixes and other improvements
A.12.4 For developers
A.13 Version 7.1 (November 2012)
A.13.1 New plugins
A.13.2 Library updates
A.13.3 GATE Embedded API changes
A.14 Version 7.0 (February 2012)
A.14.1 Major new features
A.14.2 Removal of deprecated functionality
A.14.3 Other enhancements and bug fixes
A.15 Version 6.1 (April 2011)
A.15.1 New CREOLE Plugins
A.15.2 Other new features and improvements
A.16 Version 6.0 (November 2010)
A.16.1 Major new features
A.16.2 Breaking changes
A.16.3 Other new features and bugfixes
A.17 Version 5.2.1 (May 2010)
A.18 Version 5.2 (April 2010)
A.18.1 JAPE and JAPE-related
A.18.2 Other Changes
A.19 Version 5.1 (December 2009)
A.19.1 New Features
A.19.2 JAPE improvements
A.19.3 Other improvements and bug fixes
A.20 Version 5.0 (May 2009)
A.20.1 Major New Features
A.20.2 Other New Features and Improvements
A.20.3 Specific Bug Fixes
A.21 Version 4.0 (July 2007)
A.21.1 Major New Features
A.21.2 Other New Features and Improvements
A.21.3 Bug Fixes and Optimizations
A.22 Version 3.1 (April 2006)
A.22.1 Major New Features
A.22.2 Other New Features and Improvements
A.22.3 Bug Fixes
A.23 January 2005
A.24 December 2004
A.25 September 2004
A.26 Version 3 Beta 1 (August 2004)
A.27 July 2004
A.28 June 2004
A.29 April 2004
A.30 March 2004
A.31 Version 2.2 – August 2003
A.32 Version 2.1 – February 2003
A.33 June 2002
B Version 5.1 Plugins Name Map
C Obsolete CREOLE Plugins
C.1 Ontotext JapeC Compiler
C.2 Google Plugin
C.3 Yahoo Plugin
C.3.1 Using the YahooPR
C.4 Gazetteer Visual Resource - GAZE
C.4.1 Display Modes
C.4.2 Linear Definition Pane
C.4.3 Linear Definition Toolbar
C.4.4 Operations on Linear Definition Nodes
C.4.5 Gazetteer List Pane
C.4.6 Mapping Definition Pane
C.5 Google Translator PR
D Design Notes
D.1 Patterns
D.1.1 Components
D.1.2 Model, view, controller
D.1.3 Interfaces
D.2 Exception Handling
E Ant Tasks for GATE
E.1 Declaring the Tasks
E.2 The packagegapp task - bundling an application with its dependencies
E.2.1 Introduction
E.2.2 Basic Usage
E.2.3 Handling Non-Plugin Resources
E.2.4 Streamlining your Plugins
E.2.5 Bundling Extra Resources
E.3 The expandcreoles Task - Merging Annotation-Driven Config into creole.xml
F Named-Entity State Machine Patterns
F.1 Main.jape
F.2 first.jape
F.3 firstname.jape
F.4 name.jape
F.4.1 Person
F.4.2 Location
F.4.3 Organization
F.4.4 Ambiguities
F.4.5 Contextual information
F.5 name_post.jape
F.6 date_pre.jape
F.7 date.jape
F.8 reldate.jape
F.9 number.jape
F.10 address.jape
F.11 url.jape
F.12 identifier.jape
F.13 jobtitle.jape
F.14 final.jape
F.15 unknown.jape
F.16 name_context.jape
F.17 org_context.jape
F.18 loc_context.jape
F.19 clean.jape
G Part-of-Speech Tags used in the Hepple Tagger
H Copyright and Licence
Part I
GATE Basics [#]
Chapter 1
Introduction [#]
GATE1 is an infrastructure for developing and deploying software components that process human language. It is nearly 15 years old and is in active use for all types of computational task involving human language. GATE excels at text analysis of all shapes and sizes. From large corporations to small startups, from €multi-million research consortia to undergraduate projects, our user community is the largest and most diverse of any system of this type, and is spread across all but one of the continents2.
GATE is open source free software; users can obtain free support from the user and developer community via GATE.ac.uk or on a commercial basis from our industrial partners. We are the biggest open source language processing project with a development team more than double the size of the largest comparable projects (many of which are integrated with GATE3). More than €5 million has been invested in GATE development4; our objective is to make sure that this continues to be money well spent for all GATE’s users.
The GATE family of tools has grown over the years to include a desktop client for developers, a workflow-based web application, a Java library, an architecture and a process. GATE is:
-
an IDE, GATE Developer: an integrated development environment5 for language processing components bundled with a very widely used Information Extraction system and a comprehensive set of other plugins
-
a cloud computing solution for hosted large-scale text processing, GATE Cloud (https://cloud.gate.ac.uk/). See also Chapter 24.
-
a web app, GATE Teamware: a collaborative annotation environment for factory-style semantic annotation projects built around a workflow engine and a heavily-optimised backend service infrastructure. See also Chapter 25.
-
a multi-paradigm search repository, GATE Mímir, which can be used to index and search over text, annotations, semantic schemas (ontologies), and semantic meta-data (instance data). It allows queries that arbitrarily mix full-text, structural, linguistic and semantic queries and that can scale to terabytes of text. See also Chapter 26.
-
a framework, GATE Embedded: an object library optimised for inclusion in diverse applications giving access to all the services used by GATE Developer and more.
-
an architecture: a high-level organisational picture of how language processing software composition.
-
a process for the creation of robust and maintainable services.
We also develop:
-
a wiki/CMS, GATE Wiki (http://gatewiki.sf.net/), mainly to host our own websites and as a testbed for some of our experiments
For more information on the GATE family see http://gate.ac.uk/family/ and also Part IV of this book.
One of our original motivations was to remove the necessity for solving common engineering problems before doing useful research, or re-engineering before deploying research results into applications. Core functions of GATE take care of the lion’s share of the engineering:
-
modelling and persistence of specialised data structures
-
measurement, evaluation, benchmarking (never believe a computing researcher who hasn’t measured their results in a repeatable and open setting!)
-
visualisation and editing of annotations, ontologies, parse trees, etc.
-
a finite state transduction language for rapid prototyping and efficient implementation of shallow analysis methods (JAPE)
-
extraction of training instances for machine learning
-
pluggable machine learning implementations (Weka, SVM Light, ...)
On top of the core functions GATE includes components for diverse language processing tasks, e.g. parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. GATE Developer and Embedded are supplied with an Information Extraction system (ANNIE) which has been adapted and evaluated very widely (numerous industrial systems, research systems evaluated in MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.). ANNIE is often used to create RDF or OWL (metadata) for unstructured content (semantic annotation).
GATE version 1 was written in the mid-1990s; at the turn of the new millennium we completely rewrote the system in Java; version 5 was released in June 2009; and version 6 — in November 2010. We believe that GATE is the leading system of its type, but as scientists we have to advise you not to take our word for it; that’s why we’ve measured our software in many of the competitive evaluations over the last decade-and-a-half (MUC, TREC, ACE, DUC and more; see Section 1.4 for details). We invite you to give it a try, to get involved with the GATE community, and to contribute to human language science, engineering and development.
This book describes how to use GATE to develop language processing components, test their performance and deploy them as parts of other applications. In the rest of this chapter:
-
Section 1.1 describes the best way to use this book;
-
Section 1.2 briefly notes that the context of GATE is applied language processing, or Language Engineering;
-
Section 1.3 gives an overview of developing using GATE;
-
Section 1.4 lists publications describing GATE performance in evaluations;
-
Section 1.5 outlines what is new in the current version of GATE;
-
Section 1.6 lists other publications about GATE.
Note: if you don’t see the component you need in this document, or if we mention a component that you can’t see in the software, contact gate-users@lists.sourceforge.net6 – various components are developed by our collaborators, who we will be happy to put you in contact with. (Often the process of getting a new component is as simple as typing the URL into GATE Developer; the system will do the rest.)
1.1 How to Use this Text [#]
The material presented in this book ranges from the conceptual (e.g. ‘what is software architecture?’) to practical instructions for programmers (e.g. how to deal with GATE exceptions) and linguists (e.g. how to write a pattern grammar). Furthermore, GATE’s highly extensible nature means that new functionality is constantly being added in the form of new plugins. Important functionality is as likely to be located in a plugin as it is to be integrated into the GATE core. This presents something of an organisational challenge. Our (no doubt imperfect) solution is to divide this book into three parts. Part I covers installation, using the GATE Developer GUI and using ANNIE, as well as providing some background and theory. We recommend the new user to begin with Part I. Part II covers the more advanced of the core GATE functionality; the GATE Embedded API and JAPE pattern language among other things. Part III provides a reference for the numerous plugins that have been created for GATE. Although ANNIE provides a good starting point, the user will soon wish to explore other resources, and so will need to consult this part of the text. We recommend that Part III be used as a reference, to be dipped into as necessary. In Part III, plugins are grouped into broad areas of functionality.
1.2 Context [#]
GATE can be thought of as a Software Architecture for Language Engineering [Cunningham 00].
‘Software Architecture’ is used rather loosely here to mean computer infrastructure for software development, including development environments and frameworks, as well as the more usual use of the term to denote a macro-level organisational structure for software systems [Shaw & Garlan 96].
Language Engineering (LE) may be defined as:
…the discipline or act of engineering software systems that perform tasks involving processing human language. Both the construction process and its outputs are measurable and predictable. The literature of the field relates to both application of relevant scientific results and a body of practice. [Cunningham 99a]
The relevant scientific results in this case are the outputs of Computational Linguistics, Natural Language Processing and Artificial Intelligence in general. Unlike these other disciplines, LE, as an engineering discipline, entails predictability, both of the process of constructing LE-based software and of the performance of that software after its completion and deployment in applications.
Some working definitions:
-
Computational Linguistics (CL): science of language that uses computation as an investigative tool.
-
Natural Language Processing (NLP): science of computation whose subject matter is data structures and algorithms for computer processing of human language.
-
Language Engineering (LE): building NLP systems whose cost and outputs are measurable and predictable.
-
Software Architecture: macro-level organisational principles for families of systems. In this context is also used as infrastructure.
-
Software Architecture for Language Engineering (SALE): software infrastructure, architecture and development tools for applied CL, NLP and LE.
(Of course the practice of these fields is broader and more complex than these definitions.)
In the scientific endeavours of NLP and CL, GATE’s role is to support experimentation. In this context GATE’s significant features include support for automated measurement (see Chapter 10), providing a ‘level playing field’ where results can easily be repeated across different sites and environments, and reducing research overheads in various ways.
1.3 Overview [#]
1.3.1 Developing and Deploying Language Processing Facilities [#]
GATE as an architecture suggests that the elements of software systems that process natural language can usefully be broken down into various types of component, known as resources7. Components are reusable software chunks with well-defined interfaces, and are a popular architectural form, used in Sun’s Java Beans and Microsoft’s .Net, for example. GATE components are specialised types of Java Bean, and come in three flavours:
-
LanguageResources (LRs) represent entities such as lexicons, corpora or ontologies;
-
ProcessingResources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or ngram modellers;
-
VisualResources (VRs) represent visualisation and editing components that participate in GUIs.
These definitions can be blurred in practice as necessary.
Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection of REusable Objects for Language Engineering. All the resources are packaged as Java Archive (or ‘JAR’) files, plus some XML configuration data. The JAR and XML files are made available to GATE by putting them on a web server, or simply placing them in the local file space. Section 1.3.2 introduces GATE’s built-in resource set.
When using GATE to develop language processing functionality for an application, the developer uses GATE Developer and GATE Embedded to construct resources of the three types. This may involve programming, or the development of Language Resources such as grammars that are used by existing Processing Resources, or a mixture of both. GATE Developer is used for visualisation of the data structures produced and consumed during processing, and for debugging, performance measurement and so on. For example, figure 1.1 is a screenshot of one of the visualisation tools.
GATE Developer is analogous to systems like Mathematica for Mathematicians, or JBuilder for Java programmers: it provides a convenient graphical environment for research and development of language processing software.
When an appropriate set of resources have been developed, they can then be embedded in the target client application using GATE Embedded. GATE Embedded is supplied as a series of JAR files.8 To embed GATE-based language processing facilities in an application, these JAR files are all that is needed, along with JAR files and XML configuration files for the various resources that make up the new facilities.
1.3.2 Built-In Components [#]
GATE includes resources for common LE data structures and algorithms, including documents, corpora and various annotation types, a set of language analysis components for Information Extraction and a range of data visualisation and editing components.
GATE supports documents in a variety of formats including XML, RTF, email, HTML, SGML and plain text. In all cases the format is analysed and converted into a single unified model of annotation. The annotation format is a modified form of the TIPSTER format [Grishman 97] which has been made largely compatible with the Atlas format [Bird & Liberman 99], and uses the now standard mechanism of ‘stand-off markup’. GATE documents, corpora and annotations are stored in databases of various sorts, visualised via the development environment, and accessed at code level via the framework. See Chapter 5 for more details of corpora etc.
A family of Processing Resources for language analysis is included in the shape of ANNIE, A Nearly-New Information Extraction system. These components use finite state techniques to implement various tasks from tokenisation to semantic tagging or verb phrase chunking. All ANNIE components communicate exclusively via GATE’s document and annotation resources. See Chapter 6 for more details. Other CREOLE resources are described in Part III.
1.3.3 Additional Facilities in GATE Developer/Embedded [#]
Three other facilities in GATE deserve special mention:
-
JAPE, a Java Annotation Patterns Engine, provides regular-expression based pattern/action rules over annotations – see Chapter 8.
-
The ‘annotation diff’ tool in the development environment implements performance metrics such as precision and recall for comparing annotations. Typically a language analysis component developer will mark up some documents by hand and then use these along with the diff tool to automatically measure the performance of the components. See Chapter 10.
-
GUK, the GATE Unicode Kit, fills in some of the gaps in the JDK’s9 support for Unicode, e.g. by adding input methods for various languages from Urdu to Chinese. See Section 3.11.2 for more details.
1.3.4 An Example [#]
This section gives a very brief example of a typical use of GATE to develop and deploy language processing capabilities in an application, and to generate quantitative results for scientific publication.
Let’s imagine that a developer called Fatima is building an email client10 for Cyberdyne Systems’ large corporate Intranet. In this application she would like to have a language processing system that automatically spots the names of people in the corporation and transforms them into mailto hyperlinks.
A little investigation shows that GATE’s existing components can be tailored to this purpose. Fatima starts up GATE Developer, and creates a new document containing some example emails. She then loads some processing resources that will do named-entity recognition (a tokeniser, gazetteer and semantic tagger), and creates an application to run these components on the document in sequence. Having processed the emails, she can see the results in one of several viewers for annotations.
The GATE components are a decent start, but they need to be altered to deal specially with people from Cyberdyne’s personnel database. Therefore Fatima creates new ‘cyber-’ versions of the gazetteer and semantic tagger resources, using the ‘bootstrap’ tool. This tool creates a directory structure on disk that has some Java stub code, a Makefile and an XML configuration file. After several hours struggling with badly written documentation, Fatima manages to compile the stubs and create a JAR file containing the new resources. She tells GATE Developer the URL of these files11, and the system then allows her to load them in the same way that she loaded the built-in resources earlier on.
Fatima then creates a second copy of the email document, and uses the annotation editing facilities to mark up the results that she would like to see her system producing. She saves this and the version that she ran GATE on into her serial datastore. From now on she can follow this routine:
-
Run her application on the email test corpus.
-
Check the performance of the system by running the ‘annotation diff’ tool to compare her manual results with the system’s results. This gives her both percentage accuracy figures and a graphical display of the differences between the machine and human outputs.
-
Make edits to the code, pattern grammars or gazetteer lists in her resources, and recompile where necessary.
-
Tell GATE Developer to re-initialise the resources.
-
Go to 1.
To make the alterations that she requires, Fatima re-implements the ANNIE gazetteer so that it regenerates itself from the local personnel data. She then alters the pattern grammar in the semantic tagger to prioritise recognition of names from that source. This latter job involves learning the JAPE language (see Chapter 8), but as this is based on regular expressions it isn’t too difficult.
Eventually the system is running nicely, and her accuracy is 93% (there are still some problem cases, e.g. when people use nicknames, but the performance is good enough for production use). Now Fatima stops using GATE Developer and works instead on embedding the new components in her email application using GATE Embedded. This application is written in Java, so embedding is very easy12: the GATE JAR files are added to the project CLASSPATH, the new components are placed on a web server, and with a little code to do initialisation, loading of components and so on, the job is finished in half a day – the code to talk to GATE takes up only around 150 lines of the eventual application, most of which is just copied from the example in the sheffield.examples.StandAloneAnnie class.
Because Fatima is worried about Cyberdyne’s unethical policy of developing Skynet to help the large corporates of the West strengthen their strangle-hold over the World, she wants to get a job as an academic instead (so that her conscience will only have to cope with the torture of students, as opposed to humanity). She takes the accuracy measures that she has attained for her system and writes a paper for the Journal of Nasturtium Logarithm Incitement describing the approach used and the results obtained. Because she used GATE for development, she can cite the repeatability of her experiments and offer access to example binary versions of her software by putting them on an external web server.
And everybody lived happily ever after.
1.4 Some Evaluations [#]
This section contains an incomplete list of publications describing systems that used GATE in competitive quantitative evaluation programmes. These programmes have had a significant impact on the language processing field and the widespread presence of GATE is some measure of the maturity of the system and of our understanding of its likely performance on diverse text processing tasks.
-
describes the performance of an SVM-based learning system in the NTCIR-6 Patent Retrieval Task. The system achieved the best result on two of three measures used in the task evaluation, namely the R-Precision and F-measure. The system obtained close to the best result on the remaining measure (A-Precision).
-
describes a cross-source coreference resolution system based on semantic clustering. It uses GATE for information extraction and the SUMMA system to create summaries and semantic representations of documents. One system configuration ranked 4th in the Web People Search 2007 evaluation.
-
describes a cross-lingual summarization system which uses SUMMA components and the Arabic plugin available in GATE to produce summaries in English from a mixture of English and Arabic documents.
-
Open-Domain Question Answering:
-
The University of Sheffield has a long history of research into open-domain question answering. GATE has formed the basis of much of this research resulting in systems which have ranked highly during independent evaluations since 1999. The first successful question answering system developed at the University of Sheffield was evaluated as part of TREC 8 and used the LaSIE information extraction system (the forerunner of ANNIE) which was distributed with GATE [Humphreys et al. 99]. Further research was reported in [Scott & Gaizauskas. 00], [Greenwood et al. 02], [Gaizauskas et al. 03], [Gaizauskas et al. 04] and [Gaizauskas et al. 05]. In 2004 the system was ranked 9th out of 28 participating groups.
-
describes techniques for answering definition questions. The system uses definition patterns manually implemented in GATE as well as learned JAPE patterns induced from a corpus. In 2004, the system was ranked 4th in the TREC/QA evaluations.
-
describes a multidocument summarization system implemented using summarization components compatible with GATE (the SUMMA system). The system was ranked 2nd in the Document Understanding Evaluation programmes.
-
[Maynard et al. 03e] and [Maynard et al. 03d]
-
describe participation in the TIDES surprise language program. ANNIE was adapted to Cebuano with four person days of effort, and achieved an F-measure of 77.5%. Unfortunately, ours was the only system participating!
-
[Maynard et al. 02b] and [Maynard et al. 03b]
-
describe results obtained on systems designed for the ACE task (Automatic Content Extraction). Although a comparison to other participating systems cannot be revealed due to the stipulations of ACE, results show 82%-86% precision and recall.
-
describes the LaSIE-II system used in MUC-7.
-
describes the LaSIE-II system used in MUC-6.
1.5 Recent Changes [#]
This section details recent changes made to GATE. Appendix A provides a complete change log.
It was brought to our attention that in versions 9.0.1 and below there was a very small chance that the GUI action “Export for GATE Cloud” could be compromised. This would have required malicious code to be running locally on the machine; either by another user on a multi-user machine or because the computer had already been compromised. This issue only occurred within the GUI action and did not affect API use of the gate-core Maven artifact. Note that no known exploits exist for this issue, and we do not know for certain that the code could be exploited. If, however, you are at all concerned then we suggest you regenerate any packaged applications using a recent version of GATE Developer; at minimum 9.2-SNAPSHOT built on or after the 10th of August 2022.
1.5.1 Version 9.0.1 (March 2021) [#]
GATE Developer 9.0.1 is a bugfix release – the only change is to the way URL redirects are handled when loading a document. Support for following redirects from http to https was added in 9.0 which, while correct, broke the way URLs were used within GCP. This release fixes that bug and adds some additional security checking to the redirect handling.
1.5.2 Version 9.0 (February 2021) [#]
Whilst the majority of changes in GATE Developer 9.0 are small a number of them change default behaviour (in the UI or API) hence the change in version number. These changes include:
-
We now recommend users install a 64 bit version of Java whenever possible. This seems to be especially important on Windows.
-
We now default to assuming documents are UTF-8 encoded unless you specify otherwise. In previous versions if no encoding was specified GATE would use the default platform encoding, but this seemed to cause more problems than it solved (especially for Windows users). If you want the old behaviour then ensure the encoding parameter is set to the empty string when creating a document.
-
GATE uses a library called XStream for saving and loading GATE XML documents and applications. This allows us to store features of any Java type, but that can be abused by maliciously crafted files. In general use this is unlikely to be a problem, but in situations where GATE may be used as part of a service with no way of vetting input files it could present a serious security threat. XStream now offers a security framework to restrict the types of objects that can be loaded/saved. This can work either by allowing only specific types or by preventing specific types from being used. As we often do not know in advance what features might be used we have opted to use a minimal blacklist as the default security setting. This blocks the Java classes known to be exploitable. This can be further configured via calls to Gate.setXStreamSecurity() and we strongly encourage developers who depend on gate-core within larger applications to configure this based on their specific use cases.
-
Developers wishing to build GATE from source need to use Maven v3.6.0 or above.
-
Previous versions of GATE used Log4J for some of the logging. This was problematic when using gate-core as a dependency in larger projects and was awkward to configure properly. In this release we’ve switched to using SLF4J allowing the actual logging back-end to be configured independently. Plugins and code compiled against previous versions of GATE should work with the new release without change (we include the log4j-over-slf4j bridge as a dependency), although Log4J specific methods within gate-core have been deprecated and may be removed in a future release.
Many bugs have been fixed and documentation improved, in particular:
-
the Twitter plugin has been improved to make better use of the information provided by Twitter within a JSON Tweet object. The Hashtag tokenizer has been updated to provide a tokenized feature to make grouping semantically similar hashtags easier. Lots of other minor improvements and efficiency changes have been made throughout the rest of the TwitIE pipelines.
-
the ANNIE gazetteers have been updated to better support different ways of referring to countries and a blacklist option to prevent things being wrongly annotated.
-
A new addition to the JAPE syntax allows you to copy all features from a matched annotation to the new annotation being created
-
the Format_CSV plugin now allows the document cell to be interpreted as being a URL pointing to the document to load rather than the contents of the document. See Section 23.33 for more details.
1.5.3 Version 8.6.1 (January 2020) [#]
GATE Developer 8.6.1 is a bugfix release – the only change is to adjust for the fact that the Central Maven repository has been switched from http to https.
1.5.4 Version 8.6 (June 2019) [#]
GATE Developer 8.6 is mainly a maintenance and stability release, but there are some important new features, in particular around the processing of Twitter data:
-
The Format_Twitter plugin can now correctly handle extended 280 character tweets and the latest Twitter JSON format. See Section 17.2 for full details.
-
The new Format_JSON plugin provides import/export support for GATE JSON. This is essentially the old style Twitter format, but it no longer needs to track changes to the Twitter JSON format so should be more suitable for long term storage of GATE documents as JSON files. See Section 23.30 for more details. This plugin makes use of a new mechanism whereby document format parsers can take parameters via the document MIME type, which may be useful to third party formats too.
Many bugs have been fixed and documentation improved, in particular:
-
The plugin loading mechanism now properly respects the user’s Maven settings.xml:
-
HTTP proxy and “mirror” repository settings now work properly, including authentication. Also plugin resolution will now use the system proxy (if there is one) by default if there is no proxy specified in the Maven settings.
-
The “offline” setting is respected, and will prevent GATE from trying to fetch plugins from remote repositories altogether – for this to work, all the plugins you want to use must already be cached locally, or you can use “Export for GATE Cloud” to make a self-contained copy of an application including all its plugins.
-
-
Upgraded many dependencies including Tika and Jackson to avoid known security bugs in the previous versions.
-
Documentation improvements for the Kea plugin, the Corpus QA and annotation diff tools, and the default GATE XML and inline XML formats (section 3.9.1)
-
For plugin developers, the standard plugin testing framework generates a report detailing all the plugin-to-plugin dependencies, including those that are only expressed in the plugin’s example saved applications (section 7.12.1).
Some obsolete plugins have been removed (Websphinx web crawler, which depends on an unmaintained library, and the RASP parser, whose external binary is no longer available for modern operating systems), and there are many smaller bug fixes and improvements.
Note: following changes to Oracle’s JDK licensing scheme, we now recommend running GATE using the freely-available OpenJDK. The AdoptOpenJDK project offers simple installers for all major platforms, and major Linux distributions such as Ubuntu and CentOS offer OpenJDK packages as standard. See section 2.2 for full installation instructions.
1.5.5 Version 8.5.1 (June 2018) [#]
Version 8.5.1 is a minor release to fix a few critical bugs in 8.5:
-
Fixed an exception that prevented the ANNIC search GUI from opening.
-
Fixed a problem with “Export for GATE Cloud” that meant some resources were not getting included in the output ZIP file.
-
Fixed the XML schema in the gate-spring library.
1.5.6 Version 8.5 (May 2018) [#]
GATE Developer and Embedded 8.5 introduces a number of significant internal changes to the way plugins are managed, but with the exception of the plugin manager most users will not see significant changes in the way they use GATE.
-
The GATE plugins are no longer bundled with the GATE Developer distribution, instead each plugin is downloaded from a repository at runtime, the first time it is used. This means the distribution is much smaller than previous versions.
-
Most plugins are now distributed as a single JAR file through the Java-standard “Central Repository”, and resource files such as gazetteers and JAPE grammars are bundled inside the plugin JAR rather than being separate files on disk. If you want to modify the resources of a plugin then GATE provides a tool to extract an editable copy of the files from a plugin onto your disk – it is no longer possible to edit plugin grammars in place.
-
This makes dependencies between plugins much easier to manage – a plugin can specify its dependencies declaratively by name and version number rather than by fragile relative paths between plugin directories.
GATE 8.5 remains backwards compatible with existing third-party plugins, though we encourage you to convert your plugins to the new style where possible.
Further details on these changes can be found in sections 3.5 (the plugin manager in GATE Developer), 7.3 (loading plugins via the GATE Embedded API), 7.12 (creating a new plugin from scratch), and 7.20 (converting an existing plugin to the new style).
If you have an existing saved application from GATE version 8.4.1 or earlier it will be necessary to “upgrade” it to use the new core plugins. An upgrade tool is provided on the “Tools” menu of GATE Developer, and is described in section Section 3.9.5.
For developers
As part of this release, GATE development has moved from SourceForge to GitHub – bug reports, patches and feature requests should now use the GitHub issue tracker as described in section 12.1.
1.6 Further Reading [#]
Lots of documentation lives on the GATE web site, including:
For more details about Sheffield University’s work in human language processing see the NLP group pages or A Definition and Short History of Language Engineering ([Cunningham 99a]). For more details about Information Extraction see IE, a User Guide or the GATE IE pages.
A list of publications on GATE and projects that use it (some of which are available on-line from http://gate.ac.uk/gate/doc/papers.html):
2010
-
describes the Teamware web-based collaborative annotation environment, emphasising the different roles that users play in the corpus annotation process.
-
presents the use of GATE in the development of controlled natural language interfaces. There is other related work by Damljanovic, Agatonovic, and Cunningham on using GATE to build natural language interfaces for quering ontologies.
-
discusses the use of GATE to process South Asian languages (Hindi and Gujarati).
2009
-
focuses in detail on the use of GATE for mining opinions and facts for business intelligence gathering from web content.
-
presents in more detail the text alignment component of GATE.
-
is the ‘Human Language Technologies’ chapter of ‘Semantic Knowledge Management’ (John Davies, Marko Grobelnik and Dunja Mladenić eds.)
-
discusses the use of semantic annotation for software engineering, as part of the TAO research project.
-
reviews the current state of the art in email processing and communication research, focusing on the roles played by email in information management, and commercial and research efforts to integrate a semantic-based approach to email.
-
investigates two techniques for making SVMs more suitable for language learning tasks. Firstly, an SVM with uneven margins (SVMUM) is proposed to deal with the problem of imbalanced training data. Secondly, SVM active learning is employed in order to alleviate the difficulty in obtaining labelled training data. The algorithms are presented and evaluated on several Information Extraction (IE) tasks.
2008
-
presents our approach to automatic patent enrichment, tested in large-scale, parallel experiments on USPTO and EPO documents.
-
presents Question-based Interface to Ontologies (QuestIO) - a tool for querying ontologies using unconstrained language-based queries.
-
presents a semantic-based prototype that is made for an open-source software engineering project with the goal of exploring methods for assisting open-source developers and software users to learn and maintain the system without major effort.
-
presents ServiceFinder.
-
describes our SVM-based system and several techniques we developed successfully to adapt SVM for the specific features of the F-term patent classification task.
-
reviews the recent developments in applying geometric and quantum mechanics methods for information retrieval and natural language processing.
-
investigates the state of the art in automatic textual annotation tools, and examines the extent to which they are ready for use in the real world.
-
discusses methods of measuring the performance of ontology-based information extraction systems, focusing particularly on the Balanced Distance Metric (BDM), a new metric we have proposed which aims to take into account the more flexible nature of ontologically-based applications.
-
investigates NLP techniques for ontology population, using a combination of rule-based approaches and machine learning.
-
presents the QuestIO system – a natural language interface for accessing structured information, that is domain independent and easy to use without training.
2007
-
describes an ontologically based approach to multi-source, multilingual information extraction.
-
presents a controlled language for ontology editing and a software implementation, based partly on standard NLP tools, for processing that language and manipulating an ontology.
-
proposes a methodology to capture (1) the evolution of metadata induced by changes to the ontologies, and (2) the evolution of the ontology induced by changes to the underlying metadata.
-
describes the development of a system for content mining using domain ontologies, which enables the extraction of relevant information to be fed into models for analysis of financial and operational risk and other business intelligence applications such as company intelligence, by means of the XBRL standard.
-
describes experiments for the cross-document coreference task in SemEval 2007. Our cross-document coreference system uses an in-house agglomerative clustering implementation to group documents referring to the same entity.
-
describes the application of ontology-based extraction and merging in the context of a practical e-business application for the EU MUSING Project where the goal is to gather international company intelligence and country/region information.
-
introduces a hierarchical learning approach for IE, which uses the target ontology as an essential part of the extraction process, by taking into account the relations between concepts.
-
proposes some new evaluation measures based on relations among classification labels, which can be seen as the label relation sensitive version of important measures such as averaged precision and F-measure, and presents the results of applying the new evaluation measures to all submitted runs for the NTCIR-6 F-term patent classification task.
-
describes the algorithms and linguistic features used in our participating system for the opinion analysis pilot task at NTCIR-6.
-
describes our SVM-based system and the techniques we used to adapt the approach for the specifics of the F-term patent classification subtask at NTCIR-6 Patent Retrieval Task.
-
studies Japanese-English cross-language patent retrieval using Kernel Canonical Correlation Analysis (KCCA), a method of correlating linear relationships between two variables in kernel defined feature spaces.
2006
-
(Proceedings of the 5th International Semantic Web Conference (ISWC2006)) In this paper the problem of disambiguating author instances in ontology is addressed. We describe a web-based approach that uses various features such as publication titles, abstract, initials and co-authorship information.
-
‘Semantic Annotation and Human Language Technology’, contribution to ‘Semantic Web Technology: Trends and Research’ (Davies, Studer and Warren, eds.)
-
‘Semantic Information Access’, contribution to ‘Semantic Web Technology: Trends and Research’ (Davies, Studer and Warren, eds.)
-
presents an ontology learning approach that 1) exploits a range of information sources associated with software projects and 2) relies on techniques that are portable across application domains.
-
describes work in progress concerning the application of Controlled Language Information Extraction - CLIE to a Personal Semantic Wiki - Semper- Wiki, the goal being to permit users who have no specialist knowledge in ontology tools or languages to semi-automatically annotate their respective personal Wiki pages.
-
studies a machine learning algorithm based on KCCA for cross-language information retrieval. The algorithm is applied to Japanese-English cross-language information retrieval.
-
discusses existing evaluation metrics, and proposes a new method for evaluating the ontology population task, which is general enough to be used in a variety of situation, yet more precise than many current metrics.
-
describes an approach that allows users to create and edit ontologies simply by using a restricted version of the English language. The controlled language described is based on an open vocabulary and a restricted set of grammatical constructs.
-
describes the creation of linguistic analysis and corpus search tools for Sumerian, as part of the development of the ETCSL.
-
proposes an SVM based approach to hierarchical relation extraction, using features derived automatically from a number of GATE-based open-source language processing tools.
2005
-
(Proceedings of Fifth International Conference on Recent Advances in Natural Language Processing (RANLP2005)) It is a full-featured annotation indexing and search engine, developed as a part of the GATE. It is powered with Apache Lucene technology and indexes a variety of documents supported by the GATE.
-
presents the ONTOSUM system which uses Natural Language Generation (NLG) techniques to produce textual summaries from Semantic Web ontologies.
-
is an overview of the field of Information Extraction for the 2nd Edition of the Encyclopaedia of Language and Linguistics.
-
is an overview of the field of Software Architecture for Language Engineering for the 2nd Edition of the Encyclopaedia of Language and Linguistics.
-
(Euro Interactive Television Conference Paper) A system which can use material from the Internet to augment television news broadcasts.
-
(World Wide Web Conference Paper) The Web is used to assist the annotation and indexing of broadcast news.
-
(Second European Semantic Web Conference Paper) A system that semantically annotates television news broadcasts using news websites as a resource to aid in the annotation process.
-
(Proceedings of Sheffield Machine Learning Workshop) describe an SVM based IE system which uses the SVM with uneven margins as learning component and the GATE as NLP processing module.
-
(Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005)) uses the uneven margins versions of two popular learning algorithms SVM and Perceptron for IE to deal with the imbalanced classification problems derived from IE.
-
(Proceedings of Fourth SIGHAN Workshop on Chinese Language processing (Sighan-05)) a system for Chinese word segmentation based on Perceptron learning, a simple, fast and effective learning algorithm.
-
(University of Sheffield-Research Memorandum CS-05-10) User-Friendly Ontology Authoring Using a Controlled Language.
-
describes experiments on content selection for producing biographical summaries from multiple documents.
-
(Proceedings of the 2nd European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies (EWIMT 2005))Digital Media Preservation and Access through Semantically Enhanced Web-Annotation.
-
(Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005)) Extracting a Domain Ontology from Linguistic Resource Based on Relatedness Measurements.
2004
-
(LREC 2004) describes lexical and ontological resources in GATE used for Natural Language Generation.
-
(JNLE) discusses developments in GATE in the early naughties.
-
(JNLE) is the introduction to the above collection.
-
(JNLE) is a collection of papers covering many important areas of Software Architecture for Language Engineering.
-
(Anaphora Processing) gives a lightweight method for named entity coreference resolution.
-
(Machine Learning Workshop 2004) describes an SVM based learning algorithm for IE using GATE.
-
(LREC 2004) presents algorithms for the automatic induction of gazetteer lists from multi-language data.
-
(ESWS 2004) discusses ontology-based IE in the hTechSight project.
-
(AIMSA 2004) presents automatic creation and monitoring of semantic metadata in a dynamic knowledge portal.
-
describes an approach to mining definitions.
-
describes a sentence extraction system that produces two sorts of multi-document summaries; a general-purpose summary of a cluster of related documents and an entity-based summary of documents related to a particular person.
-
(NLDB 2004) looks at ontology-based IE from parallel texts.
2003
-
(NLPXML-2003) looks at GATE for the semantic web.
-
(Corpus Linguistics 2003) describes GATE as a tool for collaborative corpus annotation.
-
(Technical Report) discusses semantic web technology in the context of multimedia indexing and search.
-
(HLT-NAACL 2003) describes experiments with geographic knowledge for IE.
-
(EACL 2003) looks at the distinction between information and content extraction.
-
(Recent Advances in Natural Language Processing 2003) looks at semantics and named-entity extraction.
-
(ACL Workshop 2003) describes NE extraction without training data on a language you don’t speak (!).
-
(EACL 2003) discusses robust, generic and query-based summarisation.
-
(Data and Knowledge Engineering) discusses multimedia indexing and search from multisource multilingual data.
-
(EACL 2003) discusses event co-reference in the MUMIS project.
-
(HLT-NAACL 2003) presents the OLLIE on-line learning for IE system.
-
(Recent Advances in Natural Language Processing 2003) discusses using parallel texts to improve IE recall.
2002
-
(LREC 2002) report results from the EMILLE Indic languages corpus collection and processing project.
-
(ACl 2002 Workshop) describes how GATE can be used as an environment for teaching NLP, with examples of and ideas for future student projects developed within GATE.
-
(NLIS 2002) discusses how GATE can be used to create HLT modules for use in information systems.
-
[Bontcheva et al. 02c], [Dimitrov 02a] and [Dimitrov 02b]
-
(TALN 2002, DAARC 2002, MSc thesis) describe the shallow named entity coreference modules in GATE: the orthomatcher which resolves pronominal coreference, and the pronoun resolution module.
-
(Computers and the Humanities) describes the philosophy and motivation behind the system, describes GATE version 1 and how well it lived up to its design brief.
-
(ACL 2002) describes the GATE framework and graphical development environment as a tool for robust NLP applications.
-
(DAARC 2002, MSc thesis) discuss lightweight coreference methods.
-
[Lal 02]
-
(Master Thesis) looks at text summarisation using GATE.
-
(ACL 2002) looks at text summarisation using GATE.
-
(ACL 2002 Summarisation Workshop) describes using GATE to build a portable IE-based summarisation system in the domain of health and safety.
-
(AIMSA 2002) describes the adaptation of the core ANNIE modules within GATE to the ACE (Automatic Content Extraction) tasks.
-
(Nordic Language Technology) describes various Named Entity recognition projects developed at Sheffield using GATE.
-
(JNLE) describes robustness and predictability in LE systems, and presents GATE as an example of a system which contributes to robustness and to low overhead systems development.
-
(LREC 2002) discusses the feasibility of grammar reuse in applications using ANNIE modules.
-
[Saggion et al. 02b] and [Saggion et al. 02a]
-
(LREC 2002, SPLPT 2002) describes how ANNIE modules have been adapted to extract information for indexing multimedia material.
-
(LREC 2002) describes GATE’s enhanced Unicode support.
Older than 2002
-
(RANLP 2001) discusses a project using ANNIE for named-entity recognition across wide varieties of text type and genre.
-
[Bontcheva et al. 00] and [Brugman et al. 99]
-
(COLING 2000, technical report) describe a prototype of GATE version 2 that integrated with the EUDICO multimedia markup tool from the Max Planck Institute.
-
(PhD thesis) defines the field of Software Architecture for Language Engineering, reviews previous work in the area, presents a requirements analysis for such systems (which was used as the basis for designing GATE versions 2 and 3), and evaluates the strengths and weaknesses of GATE version 1.
-
[Cunningham et al. 00a], [Cunningham et al. 98a] and [Peters et al. 98]
-
(OntoLex 2000, LREC 1998) presents GATE’s model of Language Resources, their access and distribution.
-
(LREC 2000) taxonomises Language Engineering components and discusses the requirements analysis for GATE version 2.
-
(COLING 2000, AISB 1999) summarise experiences with GATE version 1.
-
[Cunningham et al. 00d] and [Cunningham 99b]
-
(technical reports) document early versions of JAPE (superseded by the present document).
-
(LREC 2000) discusses experiences in the Svensk project, which used GATE version 1 to develop a reusable toolbox of Swedish language processing components.
-
(technical report) surveys users of GATE up to mid-2000.
-
(Vivek) presents the EMILLE project in the context of which GATE’s Unicode support for Indic languages has been developed.
-
(JNLE) reviewed and synthesised definitions of Language Engineering.
-
(ECAI 1998, NeMLaP 1998) report work on implementing a word sense tagger in GATE version 1.
-
(ANLP 1997) presents motivation for GATE and GATE-like infrastructural systems for Language Engineering.
-
(manual) was the guide to developing CREOLE components for GATE version 1.
-
(TIPSTER) discusses a selection of projects in Sheffield using GATE version 1 and the TIPSTER architecture it implemented.
-
[Cunningham et al. 96c, Cunningham et al. 96d, Cunningham et al. 95]
-
(COLING 1996, AISB Workshop 1996, technical report) report early work on GATE version 1.
-
(manual) was the user guide for GATE version 1.
-
[Gaizauskas et al. 96b, Cunningham et al. 97a, Cunningham et al. 96e]
-
(ICTAI 1996, TIPSTER 1997, NeMLaP 1996) report work on GATE version 1.
-
(manual) describes the language processing components distributed with GATE version 1.
-
(NeMLaP 1994, technical report) argue that software engineering issues such as reuse, and framework construction, are important for language processing R&D.
Chapter 2
Installing and Running GATE [#]
2.1 Downloading GATE [#]
To download GATE point your web browser at http://gate.ac.uk/download/.
2.2 Installing and Running GATE [#]
GATE will run anywhere that supports Java 8 or later, including Linux, Mac OS X and Windows platforms. We don’t run tests on other platforms, but have had reports of successful installs elsewhere.
We recommend using OpenJDK 1.8 (or higher). This is widely available from GNU/Linux package repositories. The AdoptOpenJDK website provides packages for various operating systems, and is particularly suitable for Windows users. Mac users should install the JDK (not just the JRE).
Note that wherever possible you should install the 64 bit version of Java as 32 bit versions can have issues with the amount of memory available for GATE to use.
2.2.1 The Easy Way [#]
The easy way to install is to use the installer (created using the excellent IzPack). Download the installer (.exe for Windows, .jar for other platforms) and follow the instructions it gives you. Once the installation is complete, you can start GATE Developer using gate.exe (Windows) or GATE.app (Mac) in the top-level installation directory, on Linux and other platforms use gate.sh in the bin directory (see section 2.2.4).
2.2.2 The Hard Way (1) [#]
-
Download and unpack the ZIP distribution, creating a directory containing jar files and scripts.
-
To run GATE Developer:
-
on Windows, use the the ‘gate.exe’ file;
-
on UNIX/Linux use ‘bin/gate.sh’.
-
on Mac use ‘GATE.app’ – if running from a terminal you can keep GATE in the foreground using GATE.app/Contents/MacOS/GATE or bin/gate.sh
-
-
To embed GATE as a library (GATE Embedded), put the JAR files in the lib folder onto your application’s classpath. Alternatively you can use a dependency manager to download GATE and its dependencies from the Central Repository by declaring a dependency on the appropriate version of group ID uk.ac.gate and artifact ID gate-core (see section 2.6.1).
2.2.3 The Hard Way (2): Git [#]
The GATE code is maintained in a set of repositories on GitHub. The main repository for GATE Developer and Embedded is gate-core, and each plugin has its own repository (typically with a name beginning gateplugin-).
All the modules (gate-core and the plugins) are built using Apache Maven version 3.5.2 or later. Clone the appropriate repository, checkout the relevant branch (“master” is the latest snapshot version), and build the code using mvn install
See section 2.6 for more details.
2.2.4 Running GATE Developer on Unix/Linux [#]
The script gate.sh in the directory bin of your installation (or distro/bin if you are building from source) can be used to start GATE Developer. You can run this script by entering its full path in a terminal or by adding the bin directory to your binary path. In addition you can also add a symbolic link to this script in any directory that already is in your binary path.
If gate.sh is invoked without parameters, GATE Developer will use the files ~/.gate.xml and ~/.gate.session to store session and configuration data. Alternately you can run gate.sh with the following parameters:
-
-h
-
show usage information
-
-ld
-
create or use the files .gate.session and .gate.xml in the current directory as the session and configuration files. If option -dc DIR occurs before this option, the file .gate.session is created from DIR/default.session if it does not already exist and the file .gate.xml is created from DIR/default.xml if it does not already exist.
-
-ln NAME
-
create or use NAME.session and NAME.xml in the current directory as the session and configuration files. If option -dc DIR occurs before this option, the file NAME.session is created from DIR/default.session if it does not already exist and the file DIR.xml is created from DIR/default.xml if it does not already exist.
-
-ll FILE
-
use the file specified to configure the logback logger of Gate Developer. Note that if this is not an absolute path and the name is identical to logback.xml then the default file on the classpath, ${GATE_HOME}/bin/logback.xml is is still used.
-
-rh LOCATION
-
set the resources home directory to the LOCATION provided. If a resources home location is provided, the URLs in a saved application are saved relative to this location instead of relative to the application state file (see section 3.9.3). This is equivalent to setting the property gate.user.resourceshome to this location.
-
-d URL
-
loads the CREOLE plugin at the given URL during the start-up process.
-
-i FILE
-
uses the specified file as the site configuration.
-
-dc DIR
-
copy default.xml and/or default.session from the directory DIR when creating a new config or session file. This option works only together with either the -ln, -ll or -tmp option and must occur before -ln, -ll or -tmp. An existing config or session file is used, but if it does not exist, the file from the given directory is copied to create the file instead of using an empty/default file.
-
-tmp
-
creates temporary configuration and session files in the current directory, optionally copying default.xml and default.session from the directory specified with a -dc DIR option that occurs before it. After GATE exits, those session and config files are removed.
-
all other parameters
-
are passed on to the java command. This can be used to e.g. set properties using the java option -D. For example to set the maximum amount of heap memory to be used when running GATE to 6000M, you can add -Xmx6000m as a parameter. In order to change the default encoding used by GATE to UTF-8 add -Dfile.encoding=utf-8 as a parameter. To specify a log4j configuration file add something like
-Dlog4j.configuration=file:///home/myuser/log4jconfig.properties.
Running GATE Developer with either the -ld or the -ln option from different directories is useful to keep several projects separate and can be used to run multiple instances of GATE Developer (or even different versions of GATE Developer) in succession or even simultanously without the configuration files getting mixed up between them.
2.3 Using System Properties with GATE [#]
During initialisation, GATE reads several Java system properties in order to decide where to find its configuration files.
Here is a list of the properties used, their default values and their meanings:
-
gate.site.config
-
points to the location of the configuration file containing the site-wide options. If not set no site config will be used.
-
gate.user.config
-
points to the file containing the user’s options. If not specified, or if the specified file does not exist at startup time, the default value of gate.xml (.gate.xml on Unix platforms) in the user’s home directory is used.
-
gate.user.session
-
points to the file containing the user’s saved session. If not specified, the default value of gate.session (.gate.session on Unix) in the user’s home directory is used. When starting up GATE Developer, the session is reloaded from this file if it exists, and when exiting GATE Developer the session is saved to this file (unless the user has disabled ‘save session on exit’ in the configuration dialog). The session is not used when using GATE Embedded.
-
gate.user.filechooser.defaultdir
-
sets the default directory to be shown in the file chooser of GATE Developer to the specified directory instead of the user’s operating-system specific default directory.
-
gate.builtin.creole.dir
-
is a URL pointing to the location of GATE’s built-in CREOLE directory. This is the location of the creole.xml file that defines the fundamental GATE resource types, such as documents, document format handlers, controllers and the basic visual resources that make up GATE. The default points to a location inside gate.jar and should not generally need to be overridden.
When using GATE Embedded, you can set the values for these properties before you call Gate.init(). Alternatively, you can set the values programmatically using the static methods setUserConfigFile(), etc. before calling Gate.init(). Note that from version 8.5 onwards, the user config file is ignored by default unless you also call runInSandbox(false) before init. See the Javadoc documentation for details.
To set these properties when running GATE developer see the next section.
2.4 Changing GATE’s launch configuration [#]
JVM options for GATE Developer are supplied in the gate.l4j.ini file on all platforms. The gate.l4j.ini file supplied by default with GATE simply sets two standard JVM memory options:
-Xmx1G -Xms200m
-Xmx specifies the maximum heap size in megabytes (m) or gigabytes (g), and -Xms specifies the initial size.
Note that the format consists of one option per line. All the properties listed in Section 2.3 can be configured here by prefixing them with -D, e.g., -Dgate.user.config=path/to/other-gate.xml.
Proxy configuration can be set in this file – by default GATE uses the system-wide proxy settings (-Djava.net.useSystemProxies=true) but a specific proxy can be configured by deleting that line and replacing it with settings such as:
-Dhttp.proxyHost=proxy.example.com -Dhttp.proxyPort=8080 -Dhttp.nonProxyHosts=*.example.com
Consult the Oracle Java Networking and Proxies documentation1 for further details of proxy configuration in Java, and see section 2.3.
For GATE Embedded, note that Java does not automatically use the system proxy settings by default, you must set java.net.useSystemProxies=true explicitly to enable this.
2.5 Configuring GATE [#]
When GATE Developer is started, or when Gate.init() is called from GATE Embedded (if you have disabled the default “sandbox” mode), GATE loads various sorts of configuration data stored as XML in a file generally called something like gate.xml or .gate.xml in your home directory. This data holds information such as:
-
whether to save settings on exit;
-
whether to save session on exit;
-
what fonts GATE Developer should use;
-
plugins to load at start;
-
colours of the annotations;
-
locations of files for the file chooser;
-
and a lot of other GUI related options;
Configuration data can be set from the GATE Developer GUI via the ‘Options’ menu then ‘Configuration’2. The user can change the appearance of the GUI in the ‘Appearance’ tab, which includes the options of font and the ‘look and feel’. The ‘Advanced’ tab enables the user to include annotation features when saving the document and preserving its format, to save the selected Options automatically on exit, and to save the session automatically on exit. These options are all stored in the user’s .gate.xml file.
2.6 Building GATE [#]
Note that you don’t need to build GATE unless you’re doing development on the system itself.
Prerequisites:
-
A conforming Java environment as above.
-
A clone of the relevant Git repository or repositories (see Section 2.2.3).
-
A working installation of Apache Maven version 3.5.2 or newer. It is advisable that you also set your JAVA_HOME environment variable to point to the top-level directory of your Java installation.
-
An appreciation of natural beauty.
To build gate-core, cd to where you cloned gate-core and:
-
Type:
mvn install -
[optional] To make the Javadoc documentation:
mvn site
In order to be able to run the GATE Developer you just built, you will also need to cd into the distro folder and run mvn compile in there, in order to create the classpath file that the GATE Developer launcher uses to find the JARs.
To build plugins cd into the plugin you just cloned and run mvn install. This will build the plugin and place it in your local Maven cache, from where GATE Developer will be able to resolve it at runtime.
Note if you are building a version of a plugin that depends on a SNAPSHOT version of gate-core then you will need to add some configuration to your Maven settings.xml file, as described in the gate-core README file.
2.6.1 Using GATE with Maven/Ivy [#]
This section is based on contributions by Marin Nozhchev (Ontotext) and Benson Margulies (Basis Technology Corp).
Stable releases of GATE (since 5.2.1) are available in the standard central Maven repository, with group ID “uk.ac.gate” and artifact ID “gate-core”. To use GATE in a Maven-based project you can simply add a dependency:
<dependency> <groupId>uk.ac.gate</groupId> <artifactId>gate-core</artifactId> <version>8.5</version> </dependency>
Similarly, with a project that uses Ivy for dependency management:
<dependency org="uk.ac.gate" name="gate-core" rev="8.5"/>
You do not need to do anything to allow GATE to access its plugins, it will fetch them at runtime from the internet when they are loaded.
Nightly snapshot builds of gate-core are available from our own Maven repository at http://repo.gate.ac.uk/content/groups/public.
2.7 Uninstalling GATE [#]
If you have used the installer, run:
java -jar uninstaller.jar
or just delete the whole of the installation directory (the one containing bin, lib, Uninstaller, etc.). The installer doesn’t install anything outside this directory, but for completeness you might also want to delete the settings files GATE creates in your home directory (.gate.xml and .gate.session).
2.8 Troubleshooting [#]
See the FAQ on the GATE Wiki for frequent questions about running and using GATE.
Chapter 3
Using GATE Developer [#]
This chapter introduces GATE Developer, which is the GATE graphical user interface. It is analogous to systems like Mathematica for mathematicians, or Eclipse for Java programmers, providing a convenient graphical environment for research and development of language processing software. As well as being a powerful research tool in its own right, it is also very useful in conjunction with GATE Embedded (the GATE API by which GATE functionality can be included in your own applications); for example, GATE Developer can be used to create applications that can then be embedded via the API. This chapter describes how to complete common tasks using GATE Developer. It is intended to provide a good entry point to GATE functionality, and so explanations are given assuming only basic knowledge of GATE. However, probably the best way to learn how to use GATE Developer is to use this chapter in conjunction with the demonstrations and tutorials movies. There are specific links to them throughout the chapter. There is also a complete new set of video tutorials here.
The basic business of GATE is annotating documents, and all the functionality we will introduce relates to that. Core concepts are;
-
the documents to be annotated,
-
corpora comprising sets of documents, grouping documents for the purpose of running uniform processes across them,
-
annotations that are created on documents,
-
annotation types such as ‘Name’ or ‘Date’,
-
annotation sets comprising groups of annotations,
-
processing resources that manipulate and create annotations on documents, and
-
applications, comprising sequences of processing resources, that can be applied to a document or corpus.
What is considered to be the end result of the process varies depending on the task, but for the purposes of this chapter, output takes the form of the annotated document/corpus. Researchers might be more interested in figures demonstrating how successfully their application compares to a ‘gold standard’ annotation set; Chapter 10 in Part II will cover ways of comparing annotation sets to each other and obtaining measures such as F1. Implementers might be more interested in using the annotations programmatically; Chapter 7, also in Part II, talks about working with annotations from GATE Embedded. For the purposes of this chapter, however, we will focus only on creating the annotated documents themselves, and creating GATE applications for future use.
GATE includes a complete information extraction system that you are free to use, called ANNIE (a Nearly-New Information Extraction System). Many users find this is a good starting point for their own application, and so we will cover it in this chapter. Chapter 6 talks in a lot more detail about the inner workings of ANNIE, but we aim to get you started using ANNIE from inside of GATE Developer in this chapter.
We start the chapter with an exploration of the GATE Developer GUI, in Section 3.1. We describe how to create documents (Section 3.2) and corpora (Section 3.3). We talk about viewing and manually creating annotations (Section 3.4).
We then talk about loading the plugins that contain the processing resources you will use to construct your application, in Section 3.5. We then talk about instantiating processing resources (Section 3.7). Section 3.8 covers applications, including using ANNIE (Section 3.8.3). Saving applications and language resources (documents and corpora) is covered in Section 3.9. We conclude with a few assorted topics that might be useful to the GATE Developer user, in Section 3.11.
3.1 The GATE Developer Main Window [#]
Figure 3.1 shows the main window of GATE Developer, as you will see it when you first run it. There are five main areas:
-
at the top, the menus bar and tools bar with menus ‘File’, ‘Options’, ‘Tools’, ‘Help’ and icons for the most frequently used actions;
-
on the left side, a tree starting from ‘GATE’ and containing ‘Applications’, ‘Language Resources’ etc. – this is the resources tree;
-
in the bottom left corner, a rectangle, which is the small resource viewer;
-
in the center, containing tabs with ‘Messages’ or the name of a resource from the resources tree, the main resource viewer;
-
at the bottom, the messages bar.
The menu and the messages bar do the usual things. Longer messages are displayed in the messages tab in the main resource viewer area.
The resource tree and resource viewer areas work together to allow the system to display diverse resources in various ways. The many resources integrated with GATE can have either a small view, a large view, or both.
At any time, the main viewer can also be used to display other information, such as messages, by clicking on the appropriate tab at the top of the main window. If an error occurs in processing, the messages tab will flash red, and an additional popup error message may also occur.
In the options dialogue from the Options menu you can choose if you want to link the selection in the resources tree and the selected main view.
3.2 Loading and Viewing Documents [#]
If you right-click on ‘Language Resources’ in the resources pane, select “New’ then ‘GATE Document’, the window ‘Parameters for the new GATE Document’ will appear as shown in figure 3.2. Here, you can specify the GATE document to be created. Required parameters are indicated with a tick. The name of the document will be created for you if you do not specify it. Enter the URL of your document or use the file browser to indicate the file you wish to use for your document source. For example, you might use ‘http://gate.ac.uk’, or browse to a text or XML file you have on disk. Click on ‘OK’ and a GATE document will be created from the source you specified.
See also the movie for creating documents.
The document editor is contained in the central tabbed pane in GATE Developer. Double-click on your document in the resources pane to view the document editor.
The document editor consists of a top panel with buttons and icons that control the display of different views and the search box. Initially, you will see just the text of your document, as shown in figure 3.3. Click on ‘Annotation Sets’ and ‘Annotations List’ to view the annotation sets to the right and the annotations list at the bottom.
You will see a view similar to figure 3.4. In place of the annotations list, you can also choose to see the annotations stack. In place of the annotation sets, you can also choose to view the co-reference editor. More information about this functionality is given in Section 3.4.
Several options can be set from the small triangle icon at the top right corner.
With ‘Save Current Layout’ you store the way the different views are shown and the annotation types highlighted in the document. Then if you set ‘Restore Layout Automatically’ you will get the same views and annotation types each time you open a document. The layout is saved to the user preferences file, gate.xml. It means that you can give this file to a new user so s/he will have a preconfigured document editor.
Another setting make the document editor ‘Read-only’. If enabled, you won’t be able to edit the text but you will still be able to edit annotations. It is useful to avoid to involuntarily modify the original text.
The option ‘Right To Left Orientation’ is useful for changing orientation of the text for the languages such as Arabic and Urdu. Selecting this option changes orientation of the text of the currently visible document.
Finally you can choose between ‘Insert Append’ and ‘Insert Prepend’. That setting is only relevant when you’re inserting text at the very border of an annotation.
If you place the cursor at the start of an annotation, in one case the newly entered text will become part of the annotation, in the other case it will stay outside. If you place the cursor at the end of an annotation, the opposite will happen.
Let use this sentence: ‘This is an [annotation].’ with the square brackets [] denoting the boundaries of the annotation. If we insert a ‘x’ just before the ‘a’ or just after the ‘n’ of ‘annotation’, here’s what we get:
Append
-
This is an x[annotation].
-
This is an [annotationx].
Prepend
-
This is an [xannotation].
-
This is an [annotation]x.
Text in a loaded document can be edited in the document viewer. The usual platform specific cut, copy and paste keyboard shortcuts should also work, depending on your operating system (e.g. CTRL-C, CTRL-V for Windows). The last icon, a magnifying glass, at the top of the document editor is for searching in the document. To prevent the new annotation windows popping up when a piece of text is selected, hold down the CTRL key. Alternatively, you can hide the annotation sets view by clicking on its button at the top of the document view; this will also cause the highlighted portions of the text to become un-highlighted.
See also Section 20.2.3 for the compound document editor.
3.3 Creating and Viewing Corpora [#]
You can create a new corpus in a similar manner to creating a new document; simply right-click on ‘Language Resources’ in the resources pane, select ‘New’ then ‘GATE corpus’. A brief dialogue box will appear in which you can optionally give a name for your corpus (if you leave this blank, a corpus name will be created for you) and optionally add documents to the corpus from those already loaded into GATE.
There are three ways of adding documents to a corpus:
-
When creating the corpus, clicking on the icon next to the “documentsList” input field brings up a popup window with a list of the documents already loaded into GATE Developer. This enables the user to add any documents to the corpus.
-
Alternatively, the corpus can be loaded first, and documents added later by double clicking on the corpus and using the + and - icons to add or remove documents to the corpus. Note that the documents must have been loaded into GATE Developer before they can be added to the corpus.
-
Once loaded, the corpus can be populated by right clicking on the corpus and selecting ‘Populate’. With this method, documents do not have to have been previously loaded into GATE Developer, as they will be loaded during the population process. If you right-click on your corpus in the resources pane, you will see that you have the option to ‘Populate’ the corpus. If you select this option, you will see a dialogue box in which you can specify a directory in which GATE will search for documents. You can specify the extensions allowable; for example, XML or TXT. This will restrict the corpus population to only those documents with the extensions you wish to load. You can choose whether to recurse through the directories contained within the target directory or restrict the population to those documents contained in the top level directory. Click on ‘OK’ to populate your corpus. This option provides a quick way to create a GATE Corpus from a directory of documents.
Additionally, right-clicking on a loaded document in the tree and selecting the ‘New corpus with this document’ option creates a new transient corpus named Corpus for document name containing just this document.
See also the movie for creating and populating corpora.
Double click on your corpus in the resources pane to see the corpus editor, shown in figure 3.5. You will see a list of the documents contained within the corpus.
In the top left of the corpus editor, plus and minus buttons allow you to add documents to the corpus from those already loaded into GATE and remove documents from the corpus (note that removing a document from a corpus does not remove it from GATE).
Up and down arrows at the top of the view allow you to reorder the documents in the corpus. The rightmost button in the view opens the currently selected document in a document editor.
At the bottom, you will see that tabs entitled ‘Initialisation Parameters’ and ‘Corpus Quality Assurance’ are also available in addition to the corpus editor tab you are currently looking at. Clicking on the ‘Initialisation Parameters’ tab allows you to view the initialisation parameters for the corpus. The ‘Corpus Quality Assurance’ tab allows you to calculate agreement measures between the annotations in your corpus. Agreement measures are discussed in depth in Chapter 10. The use of corpus quality assurance is discussed in Section 10.3.
3.4 Working with Annotations [#]
In this section, we will talk in more detail about viewing annotations, as well as creating and editing them manually. As discussed in at the start of the chapter, the main purpose of GATE is annotating documents. Whilst applications can be used to annotate the documents entirely automatically, annotation can also be done manually, e.g. by the user, or semi-automatically, by running an application over the corpus and then correcting/adding new annotations manually. Section 3.4.5 focuses on manual annotation. In Section 3.7 we talk about running processing resources on our documents. We begin by outlining the functionality around viewing annotations, organised by the GUI area to which the functionality pertains.
3.4.1 The Annotation Sets View [#]
To view the annotation sets, click on the ‘Annotation Sets’ button at the top of the document editor, or use the F3 key (see Section 3.10 for more keyboard shortcuts). This will bring up the annotation sets viewer, which displays the annotation sets available and their corresponding annotation types.
The annotation sets view is displayed on the left part of the document editor. It’s a tree-like view with a root for each annotation set. The first annotation set in the list is always a nameless set. This is the default annotation set. You can see in figure 3.4 that there is a drop-down arrow with no name beside it. Other annotation sets on the document shown in figure 3.4 are ‘Key’ and ‘Original markups’. Because the document is an XML document, the original XML markup is retained in the form of an annotation set. This annotation set is expanded, and you can see that there are annotations for ‘TEXT’, ‘body’, ‘font’, ‘html’, ‘p’, ‘table’, ‘td’ and ‘tr’.
To display all the annotations of one type, tick its checkbox or use the space key. The text segments corresponding to these annotations will be highlighted in the main text window. To delete an annotation type, use the delete key. To change the color, use the enter key. There is a context menu for all these actions that you can display by right-clicking on one annotation type, a selection or an annotation set.
If you keep shift key pressed when you open the annotation sets view, GATE Developer will try to select any annotations that were selected in the previous document viewed (if any); otherwise no annotation will be selected.
Having selected an annotation type in the annotation sets view, hovering over an annotation in the main resource viewer or right-clicking on it will bring up a popup box containing a list of the annotations associated with it, from which one can select an annotation to view in the annotation editor, or if there is only one, the annotation editor for that annotation. Figure 3.6 shows the annotation editor.
3.4.2 The Annotations List View [#]
To view the list of annotations and their features, click on the ‘Annotations list’ button at the top of the main window or use F4 key. The annotation list view will appear below the main text. It will only contain the annotations selected from the annotation sets view. These lists can be sorted in ascending and descending order for any column, by clicking on the corresponding column heading. Moreover you can hide a column by using the context menu by right-clicking on the column headings. Selecting rows in the table will blink the respective annotations in the document. Right-click on a row or selection in this view to delete or edit an annotation. Delete key is a shortcut to delete selected annotations.
3.4.3 The Annotations Stack View [#]
This view is similar to the ANNIC view described in section 9.2. It displays annotations at the document caret position with some context before and after. The annotations are stacked from top to bottom, which gives a clear view when they are overlapping.
As the view is centred on the document caret, you can use the conventional key to move it and update the view: notably the keys left and right to skip one letter; control + left/right to skip one word; up and down to go one line up or down; and use the document scrollbar then click in the document to move further.
There are two buttons at the top of the view that centre the view on the closest previous/next annotation boundary among all displayed. This is useful when you want to skip a region without annotation or when you want to reach the beginning or end of a very long annotation.
The annotation types displayed correspond to those selected in the annotation sets view. You can display feature values for an annotation rectangle by hovering the mouse on it or select only one feature to display by double-clicking on the annotation type in the first column.
Right-click on an annotation in the annotations stack view to edit it. Control-Shift-click to delete it. Double-click to copy it to another annotation set. Control-click on a feature value that contains an URL to display it in your browser.
All of these mouse shortcuts make it easier to create a gold standard annotation set.
3.4.4 The Co-reference Editor [#]
The co-reference editor allows co-reference chains (see Section 6.9) to be displayed and edited in GATE Developer. To display the co-reference editor, first open a document in GATE Developer, and then click on the Co-reference Editor button in the document viewer.
The combo box at the top of the co-reference editor allows you to choose which annotation set to display co-references for. If an annotation set contains no co-reference data, then the tree below the combo box will just show ‘Coreference Data’ and the name of the annotation set. However, when co-reference data does exist, a list of all the co-reference chains that are based on annotations in the currently selected set is displayed. The name of each co-reference chain in this list is the same as the text of whichever element in the chain is the longest. It is possible to highlight all the member annotations of any chain by selecting it in the list.
When a co-reference chain is selected, if the mouse is placed over one of its member annotations, then a pop-up box appears, giving the user the option of deleting the item from the chain. If the only item in a chain is deleted, then the chain itself will cease to exist, and it will be removed from the list of chains. If the name of the chain was derived from the item that was deleted, then the chain will be given a new name based on the next longest item in the chain.
A combo box near the top of the co-reference editor allows the user to select an annotation type from the current set. When the Show button is selected all the annotations of the selected type will be highlighted. Now when the mouse pointer is placed over one of those annotations, a pop-up box will appear giving the user the option of adding the annotation to a co-reference chain. The annotation can be added to an existing chain by typing the name of the chain (as shown in the list on the right) in the pop-up box. Alternatively, if the user presses the down cursor key, a list of all the existing annotations appears, together with the option [New Chain]. Selecting the [New Chain] option will cause a new chain to be created containing the selected annotation as its only element.
Each annotation can only be added to a single chain, but annotations of different types can be added to the same chain, and the same text can appear in more than one chain if it is referenced by two or more annotations.
The movie for inspecting results is also useful for learning about viewing annotations.
3.4.5 Creating and Editing Annotations [#]
To create annotations manually, select the text you want to annotate and hover the mouse on the selection or use control+E keys. A popup will appear, allowing you to create an annotation, as shown in figure 3.9
The type of the annotation, by default, will be the same as the last annotation you created, unless there is none, in which case it will be ‘_New_’. You can enter any annotation type name you wish in the text box, unless you are using schema-driven annotation (see Section 3.4.6). You can add or change features and their values in the table below.
To delete an annotation, click on the red X icon at the top of the popup window. To grow/shrink the span of the annotation at its start use the two arrow icons on the left or right and left keys. Use the two arrow icons next on the right to change the annotation end or alt+right and alt+left keys. Add shift and control+shift keys to make the span increment bigger. The red X icon is for removing the annotation.
The pin icon is to pin the window so that it remains where it is. If you drag and drop the window, this automatically pins it too. Pinning it means that even if you select another annotation (by hovering over it in the main resource viewer) it will still stay in the same position.
The popup menu only contains annotation types present in the Annotation Schema and those already listed in the relevant Annotation Set. To create a new Annotation Schema, see Section 3.4.6. The popup menu can be edited to add a new annotation type, however.
The new annotation created will automatically be placed in the annotation set that has been selected (highlighted) by the user. To create a new annotation set, type the name of the new set to be created in the box below the list of annotation sets, and click on ‘New’.
Figure 3.10 demonstrates adding a ‘Organization’ annotation for the string ‘EPSRC’ (highlighted in green) to the default annotation set (blank name in the annotation set view on the right) and a feature name ‘type’ with a value about to be added.
To add a second annotation to a selected piece of text, or to add an overlapping annotation to an existing one, press the CTRL key to avoid the existing annotation popup appearing, and then select the text and create the new annotation. Again by default the last annotation type to have been used will be displayed; change this to the new annotation type. When a piece of text has more than one annotation associated with it, on mouseover all the annotations will be displayed. Selecting one of them will bring up the relevant annotation popup.
To search and annotate the document automatically, use the search and annotate function as shown in figure 3.11:
-
Create and/or select an annotation to be used as a model to annotate.
-
Open the panel at the bottom of the annotation editor window.
-
Change the expression to search if necessary.
-
Use the [First] button or Enter key to select the first expression to annotate.
-
Use the [Annotate] button if the selection is correct otherwise the [Next] button. After a few cycles of [Annotate] and [Next], Use the [Ann. all next] button.
Note that after using the [First] button you can move the caret in the document and use the [Next] button to avoid continuing the search from the beginning of the document. The [?] button at the end of the search text field will help you to build powerful regular expressions to search.
3.4.6 Schema-Driven Editing [#]
Annotation schemas allow annotation types and features to be pre-specified, so that during manual annotation, the relevant options appear on the drop-down lists in the annotation editor. You can see some example annotation schemas in Section 5.4.1. Annotation schemas provide a means to define types of annotations in GATE Developer. Basically this means that GATE Developer ‘knows about’ annotations defined in a schema. Annotation schemas are supported by the ‘Annotation schema’ language resource, which is one of the default LR types (along with corpus and document) available in GATE without the need to load any plugins.
To load an annotation schema into GATE Developer, right-click on ‘Language Resources’ in the resources pane. Select ‘New’ then ‘Annotation schema’. A popup box will appear in which you can browse to your annotation schema XML file. A default set of annotation schemas for common annotation types including Person, Organization and Location is provided in the ANNIE plugin, and can be loaded by creating an Annotation schema LR from the file plugins/ANNIE/resources/schema/ANNIE-Schemas.xml in the GATE distribution. You can also define your own schemas to tell GATE Developer about other kinds of annotations you frequently use. Each schema file can define only one annotation type, but you can have a master file which includes others, in order to load a group of schemas in one operation. The ANNIE schemas provide an example of this technique.
By default GATE Developer will allow you to create any annotations in a document, whether or not there is a schema to describe them. An alternative annotation editor component is available which constrains the available annotation types and features much more tightly, based on the annotation schemas that are currently loaded. This is particularly useful when annotating large quantities of data or for use by less skilled users.
To use this, you must load the Schema_Annotation_Editor plugin. With this plugin loaded, the annotation editor will only offer the annotation types permitted by the currently loaded set of schemas, and when you select an annotation type only the features permitted by the schema are available to edit1. Where a feature is declared as having an enumerated type the available enumeration values are presented as an array of buttons, making it easy to select the required value quickly.
3.4.7 Printing Text with Annotations [#]
We suggest you to use your browser to print a document as GATE don’t propose a printing facility for the moment.
First save your document by right clicking on the document in the left resources tree then choose ‘Save Preserving Format’. You will get an XML file with all the annotations highlighted as XML tags plus the ‘Original markups’ annotations set.
It’s possible that the output will not have an XML header and footer because the document was created from a plain text document. In that case you can use the XHTML example below.
Then add a stylesheet processing instruction at the beginning of the XML file, the second line in the following minimalist XHTML document:
<?xml version="1.0" encoding="UTF-8" ?> <?xml-stylesheet type="text/css" href="gate.css"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Virtual Library</title> </head> <body> <p>Content of the document</p> ... </body> </html>
And create a file ‘gate.css’ in the same directory:
BODY, body { margin: 2em } /* or any other first level tag */ P, p { display: block } /* or any other paragraph tag */ /* ANNIE tags but you can use whatever tags you want */ /* be careful that XML tags are case sensitive */ Date { background-color: rgb(230, 150, 150) } FirstPerson { background-color: rgb(150, 230, 150) } Identifier { background-color: rgb(150, 150, 230) } JobTitle { background-color: rgb(150, 230, 230) } Location { background-color: rgb(230, 150, 230) } Money { background-color: rgb(230, 230, 150) } Organization { background-color: rgb(230, 200, 200) } Percent { background-color: rgb(200, 230, 200) } Person { background-color: rgb(200, 200, 230) } Title { background-color: rgb(200, 230, 230) } Unknown { background-color: rgb(230, 200, 230) } Etc { background-color: rgb(230, 230, 200) } /* The next block is an example for having a small tag with the name of the annotation type after each annotation */ Date:after { content: "Date"; font-size: 50%; vertical-align: sub; color: rgb(100, 100, 100); }
Finally open the XML file in your browser and print it.
Note that overlapping annotations, cannot be expressed correctly with inline XML tags and thus won’t be displayed correctly.
3.5 Using CREOLE Plugins [#]
In GATE, processing resources are used to automatically create and manipulate annotations on documents. We will talk about processing resources in the next section. However, we must first introduce CREOLE plugins. In most cases, in order to use a particular processing resource (and certain language resources) you must first load the CREOLE plugin that contains it. This section talks about using CREOLE plugins. Then, in Section 3.7, we will talk about creating and using processing resources.
The definitions of CREOLE resources (e.g. processing resources such as taggers and parsers, see Chapter 4) are stored in Maven central repository.
Plugins can have one or more of the following states in relation with GATE:
-
known
-
plugins are those plugins that the system knows about. These include all the plugins: 1. default plugins provided by Gate team. 2. The plugins added by the user manually according to the Maven artifact id. 3. those installed in the user’s own plugin directory.
-
loaded
-
plugins are the plugins currently loaded in the system. All CREOLE resource types from the loaded plugins are available for use. All known plugins can easily be loaded and unloaded using the user interface.
-
auto-loadable
-
plugins are the list of plugins that the system loads automatically during initialisation which can be configured via the load.plugin.path system property.
As hinted at above plugins can be loaded from numerous sources:
-
core plugins
-
are distributed with GATE to the Maven central repository.
-
maven plugins
-
are distributed with other parties to the Maven central repository.
-
user plugins
-
are plugins that have been installed by the user into their personal plugins folder. The location of this folder can be set either through the configuration tab of the CREOLE manager interface or via the gate.user.plugins system property
-
remote plugins
-
are plugins which are loaded via http from a remote machine.
Regular Maven users may have additional repositories or “mirror” settings configured in their m2/settings.xml file – GATE will respect these settings when retrieving plugins, including authentication with encrypted passwords in the standard Maven way with a master password in .m2/settings-security.xml. In particular if you have an organisational Maven repository configured as a <mirrorOf>external:*</mirrorOf> then this will be accepted and GATE will not attempt to use the Central Repository directly.
The CREOLE plugins can be managed through the graphical user interface which can be activated by selecting ‘Manage CREOLE Plugins’ from the ‘File’ menu. This will bring up a window listing all the known plugins. For each plugin there are two check-boxes – one labelled ‘Load Now’, which will load the plugin, and the other labelled ‘Load Always’ which will add the plugin to the list of auto-loadable plugins. A ‘Delete’ button is also provided – which will remove the plugin from the list of known plugins. This operation does not delete the actual plugin directory. Installed plugins are found automatically when GATE is started; if an installed plugin is deleted from the list, it will re-appear next time GATE is launched.
If you select a plugin, you will see in the pane on the right the list of resources that plugin contains. For example, in figure 3.12, the ‘ANNIE’ plugin is selected, and you can see that it contains 17 resources. If you wish to use a particular resource you will have to ascertain which plugin contains it. This list can be useful for that.
Having loaded the plugins you need, the resources they define will be available for use. Typically, to the GATE Developer user, this means that they will appear on the ‘New’ menu when you right-click on ‘Processing Resources’ in the resources pane, although some special plugins have different effects; for example, the Schema_Annotation_Editor (see Section 3.4.6).
Some plugins also contain files which are used to configure the resources. For example, the ANNIE plugin contains the resources for the ANNIE Gazetteer and the ANNIE NE Transducer (amongst other things). While often these files can be used straight from within the plugin, it can be useful to edit them, either to add missing information or as a starting point for delveloping new resources etc. To extract a copy of these resource files from a plugin simply select it in the plugin manager and then click the download resources button shown under the list of resources the plugin defines. This button will only be enabled for plugins which contain such files. After clicking the button you will be asked to select a directory into which to copy the files. You can then edit the files as needed before using them to configure a new instance of the appropriate processing resource.
3.6 Installing and updating CREOLE Plugins [#]
While GATE is distributed with a number of core plugins (see Part III) there are many more plugins developed and made available by other GATE users. Some of these additional plugins can easily be installed into your local copy of GATE through the CREOLE plugin manager.
Installing new plugins is simply a case of checking the box and clicking ‘Apply All’. Note that plugins are installed into the user plugins directory, which must have been correctly configured before you can try installing new plugins.
Once a plugin is installed it will appear in the list of ‘Installed Plugins’ and can be loaded in the same way as any other CREOLE plugin (see Section 3.7). If a new version of a plugin you have installed becomes available the new version will be offered as an update. These updates can be installed in the same way as a new plugin.
To register a new plugin just need simply click the ‘+’ button located at the top right corner of Plugin Manager Then you can either register a new plugin by provide the Maven Group and Artifact ID for maven plugins or provide the Dirctory URL for local or remote plugins.
3.7 Loading and Using Processing Resources [#]
This section describes how to load and run CREOLE resources not present in ANNIE. To load ANNIE, see Section 3.8.3. For technical descriptions of these resources, see the appropriate chapter in Part III (e.g. Chapter 23). First ensure that the necessary plugins have been loaded (see Section 3.5). If the resource you require does not appear in the list of Processing Resources, then you probably do not have the necessary plugin loaded. Processing resources are loaded by selecting them from the set of Processing Resources: right click on Processing Resources or select ‘New Processing Resource’ from the File menu.
For example, use the Plugin Console Manager to load the ‘Tools’ plugin. When you right click on ‘Processing Resources’ in the resources pane and select ‘New’ you have the option to create any of the processing resources that plugin provides. You may choose to create a ‘GATE Morphological Analyser’, with the default parameters. Having done this, an instance of the GATE Morphological Analyser appears under ‘Processing Resources’. This processing resource, or PR, is now available to use. Double-clicking on it in the resources pane reveals its initialisation parameters, see figure 3.14.
This processing resource is now available to be added to applications. It must be added to an application before it can be applied to documents. You may create as many of a particular processing resource as you wish, for example with different initialisation parameters. Section 3.8 talks about creating and running applications.
See also the movie for loading processing resources.
3.8 Creating and Running an Application [#]
Once all the resources you need have been loaded, an application can be created from them, and run on your corpus. Right click on ‘Applications’ and select ‘New’ and then either ‘Corpus Pipeline’ or ‘Pipeline’. A pipeline application can only be run over a single document, while a corpus pipeline can be run over a whole corpus.
To build the pipeline, double click on it, and select the resources needed to run the application (you may not necessarily wish to use all those which have been loaded).
Transfer the necessary components from the set of ‘loaded components’ displayed on the left hand side of the main window to the set of ‘selected components’ on the right, by selecting each component and clicking on the left and right arrows, or by double-clicking on each component.
Ensure that the components selected are listed in the correct order for processing (starting from the top). If not, select a component and move it up or down the list using the up/down arrows at the left side of the pane.
Ensure that any parameters necessary are set for each processing resource (by clicking on the resource from the list of selected resources and checking the relevant parameters from the pane below). For example, if you wish to use annotation sets other than the Default one, these must be defined for each processing resource.
Note that if a corpus pipeline is used, the corpus needs only to be set once, using the drop-down menu beside the ‘corpus’ box. If a pipeline is used, the document must be selected for each processing resource used.
Finally, click on ‘Run’ to run the application on the document or corpus.
See also the movie for loading and running processing resources.
For how to use the conditional versions of the pipelines see Section 3.8.2 and for saving/restoring the configuration of an application see Section 3.9.3.
3.8.1 Running an Application on a Datastore [#]
To avoid loading all your documents at the same time you can run an application on a datastore corpus.
To do this you need to load your datastore, see section 3.9.2, and to load the corpus from the datastore by double clicking on it in the datastore viewer.
Then, in the application viewer, you need to select this corpus in the drop down list of corpora.
When you run the application on the corpus datastore, each document will be loaded, processed, saved then unloaded. So at any time there will be only one document from the datastore corpus loaded. This prevent memory shortage but is also a little bit slower than if all your documents were already loaded.
The processed documents are automatically saved back to the datastore so you may want to use a copy of the datastore to experiment.
Be very careful that if you have some documents from the datastore corpus already loaded before running the application then they will not be unloaded nor saved. To save such document you have to right click on it in the resources tree view and save it to the datastore.
3.8.2 Running PRs Conditionally on Document Features [#]
The ‘Conditional Pipeline’ and ‘Conditional Corpus Pipeline’ application types are conditional versions of the pipelines mentioned in Section 3.8 and allow processing resources to be run or not according to the value of a feature on the document. In terms of graphical interface, the only addition brought by the conditional versions of the applications is a box situated underneath the lists of available and selected resources which allows the user to choose whether the currently selected processing resource will run always, never or only on the documents that have a particular value for a named feature.
If the Yes option is selected then the corresponding resource will be run on all the documents processed by the application as in the case of non-conditional applications. If the No option is selected then the corresponding resource will never be run; the application will simply ignore its presence. This option can be used to temporarily and quickly disable an application component, for debugging purposes for example.
The If value of feature option permits running specific application components conditionally on document features. When selected, this option enables two text input fields that are used to enter the name of a feature and the value of that feature for which the corresponding processing resource will be run. When a conditional application is run over a document, for each component that has an associated condition, the value of the named feature is checked on the document and the component will only be used if the value entered by the user matches the one contained in the document features.
At first sight the conditional behaviour available with these controller may seem limited, but in fact it is very powerful when used in conjunction with JAPE grammars (see chapter 8). Complex conditions can be encoded in JAPE rules which set the appropriate feature values on the document for use by the conditional controllers. Alternatively, the Groovy plugin provides a scriptable controller (see section 7.16.3) in which the execution strategy is defined by a Groovy script, allowing much richer conditional behaviour to be encoded directly in the controller’s configuration.
3.8.3 Doing Information Extraction with ANNIE [#]
This section describes how to load and run ANNIE (see Chapter 6) from GATE Developer. ANNIE is a good place to start because it provides a complete information extraction application, that you can run on any corpus. You can then view the effects.
From the File menu, select ‘Load ANNIE System’. To run it in its default state, choose ‘with Defaults’. This will automatically load all the ANNIE resources, and create a corpus pipeline called ANNIE with the correct resources selected in the right order, and the default input and output annotation sets.
If ‘without Defaults’ is selected, the same processing resources will be loaded, but a popup window will appear for each resource, which enables the user to specify a name, location and other parameters for the resource. This is exactly the same procedure as for loading a processing resource individually, the difference being that the system automatically selects those resources contained within ANNIE. When the resources have been loaded, a corpus pipeline called ANNIE will be created as before.
The next step is to add a corpus (see Section 3.3), and select this corpus from the drop-down corpus menu in the Serial Application editor. Finally click on ‘Run’ from the Serial Application editor, or by right clicking on the application name in the resources pane and selecting ‘Run’. (Many people prefer to switch to the messages tab, then run their application by right-clicking on it in the resources pane, because then it is possible to monitor any messages that appear whilst the application is running.)
To view the results, double click on one of the document contained in the corpus processed in the left hand tree view. No annotation sets nor annotations will be shown until annotations are selected in the annotation sets; the ‘Default’ set is indicated only with an unlabelled right-arrowhead which must be selected in order to make visible the available annotations. Open the default annotation set and select some of the annotations to see what the ANNIE application has done.
See also the movie for loading and running ANNIE.
3.8.4 Modifying ANNIE [#]
You will need to first make a copy of ANNIE resources by extracting them from the ANNIE plugin via the plugin manager. Once you have a copy of the resources simply locate the file(s) you want to modify, edit them, and then use them to configure the appropriate ANNIE processing resources.
3.9 Saving Applications and Language Resources [#]
In this section, we will describe how applications and language resources can be saved for use outside of GATE and for use with GATE at a later time. Section 3.9.1 talks about saving documents to file. Section 3.9.2 outlines how to use datastores. Section 3.9.3 talks about saving application states (resource parameter states), and Section 3.9.4 talks about exporting applications together with referenced files and resources to a ZIP file.
3.9.1 Saving Documents to File [#]
There are three main ways to save annotated documents:
-
in GATE’s own XML serialisation format (including all the annotations on the document);
-
an inline XML format that saves the original markup and selected annotations
-
by writing your own exporter algorithm as a processing resource
This section describes how to use the first two options.
Both types of data export are available in the popup menu triggered by right-clicking on a document in the resources tree (see Section 3.1) and selecting the “Save As...” menu. In addition, all documents in a corpus can be saved as individual XML files into a directory by right-clicking on the corpus instead of individual documents.
Selecting to save as GATE XML leads to a file open dialogue; give the name of the file you want to create, and the whole document and all its data will be exported to that file. If you later create a document from that file, the state will be restored. (Note: because GATE’s annotation model is richer than that of XML, and because our XML dump implementation sometimes cuts corners2, the state may not be identical after restoration. If your intention is to store the state for later use, use a DataStore instead.)
The ‘Inline XML’ option leads to a richer dialog than the ‘GATE XML’ option. This allows you to select the file to save to at the top of the dialog box, but also then allows you to configure extactly what is saved and how. By default the exporter is configured to save all the annotations from ‘Original markups’ (i.e. those extracted from the source document when it was loaded) as well as Person, Organization, and Location from the default set (i.e. the main output annotations from running ANNIE). Features of these annotations will also be saved.
The annotations are saved as normal XML document tags, using the annotation type as the tag name. If you choose to save features then they will be added as attributes to the relevant XML tags.
Note that GATE’s model of annotation allows graph structures, which are difficult to represent in XML (XML is a tree-structured representation format). During the dump process, annotations that cross each other in ways that cannot be represented in legal XML will be discarded, and a warning message printed.
Saving documents using this ‘Inine XML’ format that were not created from an HTML or XML file often results in a plain text file, with in-line tags for the saved annotations. In otherwords, if the set of annotations you are saving does not include an annotation which spans the entire document, the result will not be valid XML and may not load back into GATE. This format should really be considered a legacy format, and it may be removed in future versions of GATE.
3.9.2 Saving and Restoring LRs in Datastores [#]
Where corpora are large, the memory available may not be sufficient to have all documents open simultaneously. The datastore functionality provides the option to save documents to disk and open them only one at a time for processing. This means that much larger corpora can be used. A datastore can also be useful for saving documents in an efficient and lossless way.
To save a text in a datastore, a new datastore must first be created if one does not already exist. Create a datastore by right clicking on Datastore in the left hand pane, and select the option ‘Create Datastore’. Select the data store type you wish to use. Create a directory to be used as the datastore (note that the datastore is a directory and not a file).
You can either save a whole corpus to the datastore (in which case the structure of the corpus will be preserved) or you can save individual documents. The recommended method is to save the whole corpus. To save a corpus, right click on the corpus name and select the ‘Save to...’ option (giving the name of the datastore created earlier). To save individual documents to the datastore, right clicking on each document name and follow the same procedure.
To load a document from a datastore, do not try to load it as a language resource. Instead, open the datastore by right clicking on Datastore in the left hand pane, select ‘Open Datastore’ and choose the datastore to open. The datastore tree will appear in the main window. Double click on a corpus or document in this tree to open it. To save a corpus and document back to the same datastore, simply select the ‘Save’ option.
See also the movie for creating a datastore and the movie for loading corpus and documents from a datastore.
3.9.3 Saving Application States to a File [#]
Resources, and applications that are made up of them, are created based on the settings of their parameters (see Section 3.7). It is possible to save the data used to create an application to a file and re-load it later. To save the application to a file, right click on it in the resources tree and select ‘Save application state’, which will give you a file creation dialogue. Choose a file name that ends in gapp as this file dialog and the one for loading application states age displays all files which have a name ending in gapp. A common convention is to use .gapp or .xgapp as a file extension.
To restore the application later, select ‘Restore application from file’ from the ‘File’ menu.
Note that the data that is saved represents how to recreate an application – not the resources that make up the application itself. So, for example, if your application has a resource that initialises itself from some file (e.g. a grammar, a document) then that file must still exist when you restore the application.
In case you don’t want to save the corpus configuration associated with the application then you must select ‘<none>’ in the corpus list of the application before saving the application.
The file resulting from saving the application state contains the values of the initialisation and runtime parameters for all the processing resources contained by the stored application as well as the values of the initialisation parameters for all the language resources referenced by those processing resources. Note that if you reference a document that has been created with an empty URL and empty string content parameter and subsequently been manually edited to add content, that content will not be saved. In order for document content to be preserved, load the document from an URL, specify the content as for the string content parameter or use a document from a datastore.
For the parameters of type URL or “ResourceReference” (which are typically used to select resources such as grammars or rules files either from inside the plugin or elsewhere on disk) a transformation is applied so that the paths are are stored relative to either the location of the saved application state file or a special user resources home directory, according to the following rules:
-
If the property gate.user.resourceshome is set to the path of a directory and the resource is located inside that directory but the state file is saved to a location outside of this directory, the path is stored relative to this directory and the path marker $resourceshome$ is used.
-
in all other situations, the path is stored relative to the location of the application state file location and the the path marker $relpath$ is used.
References to resources inside GATE plugins are stored as a special type of URI of the form creole://group;artifact;version/path/inside/plugin. In this way, all resource files that are part of plugins are always used corretly, no matter where the plugins are stored. Resource files which are not part of a plugin and used by an application do not need to be in the same location as when the application was initially created but rather in the same location relative to the location of the application file. In addition if your application uses a project-specific location for global resources or project specific plugins, the java property gate.user.resourceshome can be set to this location and the application will be stored so that this location will also always be used correctly, no matter where the application state file is copied to. To set the resources home directory, the -rh location option for the Linux script gate.sh to start GATE can be used. The combination of these features allows the creation and deployment of portable applications by keeping the application file and the resource files used by the application together.
If your application uses resources from inside plugins then those resources may change if you upgrade your application to a newer version of the plugin. If you want to upgrade to a newer plugin but keep the same resources you should export a copy of the resource files from the plugin onto disk and load them from there instead of using the plugin-relative defaults.
When an application is restored from an application state file, GATE uses the keyword $relpath$ for paths relative to the location of the gapp file and $resourceshom$ for paths relative to the the location the property gate.user.resourceshome is set. There exists other keywords that can be interesting in some cases. You will need to edit the gapp file manually. You can use $sysprop:...$ to declare paths relative to any java system property, for example $sysprop:user.home$.
If you want to save your application along with all plugins and resources it requires you can use the ‘Export for GATE Cloud’ option (see Section 3.9.4).
See also the movie for saving and restoring applications.
3.9.4 Saving an Application with its Resources (e.g. GATE Cloud) [#]
When you save an application using the ‘Save application state’ option (see Section 3.9.3), the saved file contains references to the plugins that were loaded when the application was saved, and to any resource files required by the application. To be able to reload the file, these plugins and other dependencies must exist at the same locations (relative to the saved state file). While this is fine for saving and loading applications on a single machine it means that if you want to package your application to run it elsewhere (e.g. deploy it to GATE Cloud) then you need to be careful to include all the resource files and plugins at the right locations in your package. The ‘Export for GATE Cloud’ option on the right-click menu for an application helps to automate this process.
When you export an application in this way, GATE Developer produces a ZIP file containing the saved application state (in the same format as ‘Save application state’). Any plugins and resource files that the application refers to are also included in the zip file, and the relative paths in the saved state are rewritten to point to the correct locations within the package. The resulting package is therefore self-contained and can be copied to another machine and unpacked there, or passed to GATE Cloud for deployment. Maven-style plugins will be resolved from within the package rather than being downloaded at runtime from the internet.
There are a few important points to note about the export process:
-
All plugins that are loaded at the point when you perform the export will be included in the resulting package. Use the plugin manager to unload any plugins your application is not using before you export it.
-
If your application refers to a resource file that is in a directory on disk rather than inside one of the loaded plugins, the entire contents of this directory will be recursively included in the package. If you have a number of unrelated resources in a single directory (e.g. many sets of large gazetteer lists) you may want to separate them into separate directories so that only the relevant ones are included in the package.
-
The packager only knows about resources that your application refers to directly in its parameters. For example, if your application includes a multi-phase JAPE grammar the packager will only consider the main grammar file, not any of its sub-phases. If the sub-phases are not contained in the same directory as the main grammar you may find they are not included. If indirect references of this kind are all to files under the same directory as the ‘master’ file it will work OK.
If you require more flexibility than this option provides you should read Section E.2, which describes the underlying Ant task that the exporter uses.
3.9.5 Upgrade An Application to use Newer Versions of Plugins [#]
Some of the changes introduced in GATE 8.5 mean that applications saved with a previous version of GATE might not load without being updated. Loading such an application is likely to result in errors similar to those seen in Figure 3.15.
In order to load such application into GATE 8.5 (or above), you need first upgrade them to use compatible versions of the relevant plugins. In most cases this process can be automated and we provide a tool to walk you through the process. To start upgrading an application select ‘Upgrade XGapp’ from the ‘Tools’ menu. This will first ask you to choose an application file to upgrade and will then present the UI shown in Figure 3.16.
One the application has been analysed the tool will show you a table in which each row signifies a plugin used by the app. In the left most column it lists the plugin currently referenced by the application. This is followed by details of the new plugin. While in most cases the tool can correctly determine the right plugin to offer in this column you can correct any mistakes by double-clicking the incorrect plugin and then specifying the correct plugin location. The final two columns determine if the plugin is upgraded and to which version. The versions offered are all those which are available and known to be compatible with the version of GATE you are running. By default the latest available version will be selected, although -SNAPSHOT versions are only selected by default if you are also running a -SNAPSHOT version of GATE.
The ‘Upgrade’ column allows you to determine if and how a plugin will be upgraded. The three possible choices are Upgrade, Plugin Only, and Skip. Skip is fairly self explanatory but upgrade and plugin only require a little more explanation. Upgrade means that not only will the plugin location be upgraded, but also any resources that reside within the plugin will also be changed to reference those within the new plugin. This is the only upgrade option when considering a plugin which was originally part of the GATE distribution. The plugin only option allows you to change the application to load a new version of the plugin which leaving the resource locations untouched. This is useful for cases where you have edited the resources inside a plugin rather than having created a separate copy specific to the application.
After upgrade, the old version of the application file will still be available but will have been renamed by adding the ’.bak’ suffix.
In most cases this upgrade process will work without issue. If, however, you find you have an application which fails to open after the upgrade then it maybe because one or more plugins couldn’t be correctly mapped to new versions. In these cases the best option is to revert the upgrade (replace the xgapp file with the generated backup), load the application into GATE 8.4.1 and then use the “Export for GATE Cloud” option to produce a self contained application (see Section 3.9.4). Then finally run the upgrade tool over this version of the application.
The two buttons at the top of the dialog allow you save and restore the mappings defined in the table. This makes it easier to upgrade a set of related applications which should all be upgraded in a similar fashion.
Note that this process is not limited simply to upgrading applications saved prior to GATE 8.5 but can be used at any time to upgrade the version of a plugin used by an application.
3.10 Keyboard Shortcuts [#]
You can use various keyboard shortcuts for common tasks in GATE Developer. These are listed in this section.
General (Section 3.1):
-
F1 Display a help page for the selected component
-
Alt+F4 Exit the application without confirmation
-
Tab Put the focus on the next component or frame
-
Shift+Tab Put the focus on the previous component or frame
-
F6 Put the focus on the next frame
-
Shift+F6 Put the focus on the previous frame
-
Alt+F Show the File menu
-
Alt+O Show the Options menu
-
Alt+T Show the Tools menu
-
Alt+H Show the Help menu
-
F10 Show the first menu
Resources tree (Section 3.1):
-
Enter Show the selected resources
-
Ctrl+H Hide the selected resource
-
Ctrl+Shift+H Hide all the resources
-
F2 Rename the selected resource
-
Ctrl+F4 Close the selected resource
Document editor (Section 3.2):
-
Ctrl+F Show the search dialog for the document
-
Ctrl+E Edit the annotation at the caret position
-
Ctrl+S Save the document in a file
-
F3 Show/Hide the annotation sets
-
Shift+F3 Show the annotation sets with preselection
-
F4 Show/Hide the annotations list
-
F5 Show/Hide the coreference editor
-
F7 Show/Hide the text
Annotation editor (Section 3.4):
-
Right/Left Grow/Shrink the annotation span at its start
-
Alt+Right/Alt+Left Grow/Shrink the annotation span at its end
-
+Shift/+Ctrl+Shift Use a span increment of 5/10 characters
-
Alt+Delete Delete the currently edited annotation
Annic/Lucene datastore (Chapter 9):
-
Alt+Enter Search the expression in the datastore
-
Alt+Backspace Delete the search expression
-
Alt+Right Display the next page of results
-
Alt+Left Display the row manager
-
Alt+E Export the results to a file
Annic/Lucene query text field (Chapter 9):
-
Ctrl+Enter Insert a new line
-
Enter Search the expression
-
Alt+Top Select the previous result
-
Alt+Bottom Select the next result
3.11 Miscellaneous [#]
3.11.1 Stopping GATE from Restoring Developer Sessions/Options [#]
GATE can remember Developer options and the state of the resource tree when it exits. The options are saved by default; the session state is not saved by default. This default behaviour can be changed from the ‘Advanced’ tab of the ‘Configuration’ choice on the ‘Options’ menu (or the ‘Preferences’ option on the ‘GATE’ application menu on Mac).
If a problem occurs and the saved data prevents GATE Developer from starting, you can fix this by deleting the configuration and session data files. These are stored in your home directory, and are called gate.xml and gate.sesssion or .gate.xml and .gate.sesssion depending on platform. On Windows your home is typically:
-
95, 98, NT:
-
Windows Directory/profiles/username
-
2000, XP:
-
Windows Drive/Documents and Settings/username
-
Windows 7 or later
-
Windows Drive/Users/username
though the directory name may be in your local language if your copy of Windows is not in English.
3.11.2 Working with Unicode [#]
When you create a document from a URL pointing to textual data in GATE, you have to tell the system what character encoding the text is stored in. By default, GATE will set this parameter to be the empty string. This tells Java to use the default encoding for whatever platform it is running on at the time – e.g. on Western versions of Windows this will be ISO-8859-1, and Eastern ones ISO-8859-9. On Linux systems, the default encoding is influenced by the LANG environment variable, e.g. when this variable is set to en_US.utf-8 the default encoding used will be UTF-8. You can change the default encoding used by GATE to UTF-8 by adding -Dfile.encoding=UTF-8 to the gate.l4j.ini file.
A popular way to store Unicode documents is in UTF-8, which is a superset of ASCII (but can still store all Unicode data); if you get an error message about document I/O during reading, try setting the encoding to UTF-8, or some other locally popular encoding.
Chapter 4
CREOLE: the GATE Component Model [#]
The GATE architecture is based on components: reusable chunks of software with well-defined interfaces that may be deployed in a variety of contexts. The design of GATE is based on an analysis of previous work on infrastructure for LE, and of the typical types of software entities found in the fields of NLP and CL (see in particular chapters 4–6 of [Cunningham 00]). Our research suggested that a profitable way to support LE software development was an architecture that breaks down such programs into components of various types. Because LE practice varies very widely (it is, after all, predominantly a research field), the architecture must avoid restricting the sorts of components that developers can plug into the infrastructure. The GATE framework accomplishes this via an adapted version of the Java Beans component framework from Sun, as described in section 4.2.
GATE components may be implemented by a variety of programming languages and databases, but in each case they are represented to the system as a Java class. This class may do nothing other than call the underlying program, or provide an access layer to a database; on the other hand it may implement the whole component.
GATE components are one of three types:
-
LanguageResources (LRs) represent entities such as lexicons, corpora or ontologies;
-
ProcessingResources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or ngram modellers;
-
VisualResources (VRs) represent visualisation and editing components that participate in GUIs.
The distinction between language resources and processing resources is explored more fully in section D.1.1. Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection of REusable Objects for Language Engineering.
In the rest of this chapter:
-
Section 4.3 describes the lifecycle of GATE components;
-
Section 4.4 describes how Processing Resources can be grouped into applications;
-
Section 4.5 describes the relationship between Language Resources and their datastores;
-
Section 4.6 summarises GATE’s set of built-in components;
-
Section 4.7 describes how configuration data for Resource types is supplied to GATE.
4.1 The Web and CREOLE [#]
GATE allows resource implementations and Language Resource persistent data to be distributed over the Web, and uses Java annotations for configuration of resources (and GATE itself).
Resource implementations are grouped together as ‘plugins’, stored either in a single JAR file published via the standard Maven repository mechanism, or at a URL (when the resources are in the local file system this would be a file:/ URL). When a plugin is loaded into GATE it looks for a configuration file called creole.xml relative to the plugin URL or inside the plugin JAR file and uses the contents of this file in combination with Java annotations on the source code to determine what resources this plugin declares and, in the case of directory-style plugins, where to find the classes that implement the resource types (typically a JAR file in the plugin directory). GATE retrieves the configuration information from the plugin’s resource classes and adds the resource definitions to the CREOLE register. When a user requests an instantiation of a resource, GATE creates an instance of the resource class in the virtual machine.
Language resource data can be stored in binary serialised form in the local file system.
4.2 The GATE Framework [#]
We can think of the GATE framework as a backplane into which users can plug CREOLE components. The user gives the system a list of plugins to search when it starts up, and components in those plugins are loaded by the system.
The backplane performs these functions:
-
component discovery, bootstrapping, loading and reloading;
-
management and visualisation of native data structures for common information types;
-
generalised data storage and process execution.
A set of components plus the framework is a deployment unit which can be embedded in another application.
At their most basic, all GATE resources are Java Beans, the Java platform’s model of software components. Beans are simply Java classes that obey certain interface conventions:
-
beans must have no-argument constructors.
-
beans have properties, defined by pairs of methods named by the convention setProp and getProp .
GATE uses Java Beans conventions to construct and configure resources at runtime, and defines interfaces that different component types must implement.
4.3 The Lifecycle of a CREOLE Resource [#]
CREOLE resources exhibit a variety of forms depending on the perspective they are viewed from. Their implementation is as a Java class plus an XML metadata file living at the same URL. When using GATE Developer, resources can be loaded and viewed via the resources tree (left pane) and the ‘create resource’ mechanism. When programming with GATE Embedded, they are Java objects that are obtained by making calls to GATE’s Factory class. These various incarnations are the phases of a CREOLE resource’s ‘lifecycle’. Depending on what sort of task you are using GATE for, you may use resources in any or all of these phases. For example, you may only be interested in getting a graphical view of what GATE’s ANNIE Information Extraction system (see Chapter 6) does; in this case you will use GATE Developer to load the ANNIE resources, and load a document, and create an ANNIE application and run it on the document. If, on the other hand, you want to create your own resources, or modify the Java code of an existing resource (as opposed to just modifying its grammar, for example), you will need to deal with all the lifecycle phases.
The various phases may be summarised as:
-
Creating a new resource from scratch (bootstrapping).
-
To create the binary image of a resource (a Java class in a JAR file), and the XML file that describes the resource to GATE, you need to create the appropriate .java file(s), compile them and package them as a .jar. GATE provides a Maven archetype to start this process – see Section 7.12. Alternatively you can simply copy code from an existing resource.
-
Instantiating a resource in GATE Embedded.
-
To create a resource in your own Java code, use GATE’s Factory class (this takes care of parameterising the resource, restoring it from a database where appropriate, etc. etc.). Section 7.2 describes how to do this.
-
Loading a resource into GATE Developer.
-
To load a resource into GATE Developer, use the various ‘New ... resource’ options from the File menu and elsewhere. See Section 3.1.
-
Resource configuration and implementation.
-
GATE’s Maven archetype will create an empty resource that does nothing. In order to achieve the behaviour you require, you’ll need to change the Java code and its configuration annotations. See section 4.7 for more details.
4.4 Processing Resources and Applications [#]
PRs can be combined into applications. Applications model a control strategy for the execution of PRs. In GATE, applications are called ‘controllers’ accordingly.
Currently the main application types provided by GATE implement sequential or “pipeline” control flow. There are two main types of pipeline:
-
Simple pipelines
-
simply group a set of PRs together in order and execute them in turn. The implementing class is called SerialController.
-
Corpus pipelines
-
are specific for LanguageAnalysers – PRs that are applied to documents and corpora. A corpus pipeline opens each document in the corpus in turn, sets that document as a runtime parameter on each PR, runs all the PRs on the corpus, then closes the document. The implementing class is called SerialAnalyserController.
Conditional versions of these controllers are also available. These allow processing resources to be run conditionally on document features. See Section 3.8.2 for how to use these. If more flexibility is required, the Groovy plugin provides a scriptable controller (see section 7.16.3) whose execution strategy is specified using the Groovy programming language.
Controllers are themselves PRs – in particular a simple pipeline is a standard PR and a corpus pipeline is a LanguageAnalyser – so one pipeline can be nested in another. This is particularly useful with conditional controllers to group together a set of PRs that can all be turned on or off as a group.
There is also a real-time version of the corpus pipeline. When creating such a controller, a timeout parameter needs to be set which determines the maximum amount of time (in milliseconds) allowed for the processing of a document. Documents that take longer to process, are simply ignored and the execution moves to the next document after the timeout interval has lapsed.
All controllers have special handling for processing resources that implement the interface gate.creole.ControllerAwarePR. This interface provides methods that are called by the controller at the start and end of the whole application’s execution – for a corpus pipeline, this means before any document has been processed and after all documents in the corpus have been processed, which is useful for PRs that need to share data structures across the whole corpus, build aggregate statistics, etc. For full details, see the JavaDoc documentation for ControllerAwarePR.
4.5 Language Resources and Datastores [#]
Language Resources can be stored in Datastores. Datastores are an abstract model of disk-based persistence, which can be implemented by various types of storage mechanism. Here are the types implemented:
-
Serial Datastores
-
are based on Java’s serialisation system, and store data directly into files and directories.
-
Lucene Datastores
-
is a full-featured annotation indexing and retrieval system. It is provided as part of an extension of the Serial Datastores. See Section 9 for more details.
4.6 Built-in CREOLE Resources [#]
GATE comes with various built-in components:
-
Language Resources modelling Documents and Corpora, and various types of Annotation Schema – see Chapter 5.
-
Processing Resources that are part of the ANNIE system – see Chapter 6.
-
Gazetteers – see Chapter 13.
-
Ontologies – see Chapter 14.
-
Machine Learning resources – see Chapter 19.
-
Alignment tools – see Chapter 20.
-
Parsers and taggers – see Chapter 18.
-
Other miscellaneous resources – see Chapter 23.
4.7 CREOLE Resource Configuration [#]
This section describes how to supply GATE with the configuration data it needs about a resource, such as what its parameters are, how to display it if it has a visualisation, etc. Several GATE resources can be grouped into a single plugin, which is a directory or JAR file containing an XML configuration file called creole.xml at its root. The creole.xml file provides metadata about the plugin as a whole, the configuration for individual resource classes is given directly in the Java source file using Java annotations.
A creole.xml file has a root element <CREOLE-DIRECTORY> which supports several optional attribute:
-
NAME:
-
The name of the plugin. Used in the GUI to help identify the plugin in a nicer way than the direcory or artifact name.
-
VERSION:
-
The version number of the plugin. For example, 3, 3.1, 3.11, 3.12-SNAPSHOT etc.
-
DESCRIPTION:
-
A short description of the resources provided by the plugin. Note that there is really only space for a single sentence in the GUI.
-
GATE-MIN:
-
The earliest version of GATE that this plugin is compatible with. This should be in the same format as the version shown in the GATE titlebar, i.e. 8.5 or 8.6-beta1. Do not include the build number information.
Currently all these attributes are optional, as in most cases the information can be pulled from other elements of the plugin metadata; for example, plugins distributued via Maven will use information from the pom.xml if not specified.
For many simple single-JAR plugins the creole.xml file need have no other content – just an empty <CREOLE-DIRECTORY /> element – but there are certain child elements that are used in some types of plugin.
Directory-style plugins need at least one <JAR> child element to tell GATE where to find the classes that implement the plugin’s resources. Each <JAR> element contains a path to a JAR file, which is resolved relative to the location of the creole.xml, for example:
<CREOLE-DIRECTORY> <JAR SCAN="true">myPlugin.jar</JAR> <JAR>lib/thirdPartyLib.jar</JAR> </CREOLE-DIRECTORY>
JAR files that contain resource classes must be specified with SCAN="true", which tells GATE to scan the JAR contents to discover resource classes annotated with @CreoleResource (see below). Other JAR files required by the plugin can be specified using other <JAR> elements without SCAN="true".
Plugins can depend on other plugins, for example if a plugin defines a PR which internally makes use of a JAPE transducer then that plugin would declare that it depends on ANNIE (the standard plugin that defines the JAPE transducer PR). This is done with a <REQUIRES> element. To depend on a single-JAR plugin from a Maven repository, use an empty element with attributes GROUP, ARTIFACT and VERSION, for example
<CREOLE-DIRECTORY> <REQUIRES GROUP="uk.ac.gate.plugins" ARTIFACT="annie" VERSION="8.5" /> </CREOLE-DIRECTORY>
Directory-style plugins can also depend on other directory-style plugins using a relative path (e.g. <REQUIRES>../other-plugin</REQUIRES>, but this is generally discouraged – if your plugin is likely to be required as a dependency of other plugins then it is better converted to the single JAR Maven style so the dependency can be handled via group/artifact/version co-ordinates.
You may see old plugins with other elements such as <RESOURCE>, this is the older style of configuration in XML, which is now deprecated in favour of the annotations described below.
4.7.1 Configuring Resources using Annotations [#]
The configuration of the resources within a plugin is handled using Java annotation types to embed the configuration data directly in the Java source code. @CreoleResource is used to mark a class as a GATE resource, and parameter information is provided through annotations on the JavaBean set methods. At runtime these annotations are read and used to construct the resource data that is registered with the CREOLE register. The metadata annotation types are all marked @Documented so the CREOLE configuration data will be visible in the generated JavaDoc documentation.
For more detailed information, see the JavaDoc documentation for gate.creole.metadata.
Basic Resource-Level Data
To mark a class as a CREOLE resource, simply use the @CreoleResource annotation (in the gate.creole.metadata package), for example:
2import gate.creole.metadata.*;
3
4@CreoleResource(name = "GATE Tokeniser",
5 comment = "Splits text into tokens and spaces")
6public class Tokeniser extends AbstractLanguageAnalyser {
7 ...
8}
The @CreoleResource annotation provides slots for various configuration values:
-
name
-
(String) the name of the resource, as it will appear in the ‘New’ menu in GATE Developer. If omitted, defaults to the bare name of the resource class (without a package name).
-
comment
-
(String) a descriptive comment about the resource, which will appear as the tooltip when hovering over an instance of this resource in the resources tree in GATE Developer. If omitted, no comment is used.
-
helpURL
-
(String) a URL to a help document on the web for this resource. It is used in the help browser inside GATE Developer.
-
isPrivate
-
(boolean) should this resource type be hidden from the GATE Developer GUI, so it does not appear in the ‘New’ menus? If omitted, defaults to false (i.e. not hidden).
-
icon
-
(String) the icon to use to represent the resource in GATE Developer. If omitted, a generic language resource or processing resource icon is used. The value of this element can be:
-
a plain name such as “Application”, which is prepended with the package name gate.resources.img.svg. and the suffix “Icon”, which is assumed to be a Java class implementing javax.swing.Icon. GATE provides a collection of these icon classes which are generated from SVG files and are fully scalable for high-DPI monitors.
-
a path to an image file inside the plugin’s JAR, starting with a forward slash, e.g. /myplugin/images/icon.png
-
-
interfaceName
-
(String) the interface type implemented by this resource, for example a new type of document would specify "gate.Document" here.
-
tool
-
(boolean) is this resource type a tool? The “tool” flag identifies things like resource helpers and resources that contribute items to the tools menu in GATe Developer.
-
autoInstances
-
(array of @AutoInstance annotations) definitions for any instances of this resource that should be created automatically when the plugin is loaded. If omitted, no auto-instances are created by default. Auto-instances are useful for things like document formats and tools which contribute behaviour to other GATE resources, and which should be available by default whenever the plugin is loaded.
For visual resources only, the following elements are also available:
-
guiType
-
(GuiType enum) the type of GUI this resource defines. The options are LARGE (the VR should appear in the main right-hand panel of the GUI) or SMALL (the VR should appear in the bottom left hand corner below the resources tree).
-
resourceDisplayed
-
(String) the class name of the resource type that this VR displays, e.g. "gate.Corpus". Any resource whose type is assignable to this type will be displayed with this viewer, so for example a VR that can display all types of document would specify gate.Document, whereas a VR that can only display the default GATE document implementation would specify gate.corpora.DocumentImpl.
-
mainViewer
-
(boolean) is this VR the ‘most important’ viewer for its displayed resource type? If there are several different viewers that are all applicable to a particular resource type, the mainViewer hint helps GATE Developer decide which one should be initially visible as the selected tab.
For annotation viewers, you should specify an annotationTypeDisplayed element giving the annotation type that the viewer can display (e.g. Sentence).
Resource Parameters
Parameters are declared by placing annotations on their JavaBean set methods. To mark a setter method as a parameter, use the @CreoleParameter annotation, for example:
@CreoleParameter(comment = "The location of the list of abbreviations") public void setAbbrListUrl(URL listUrl) { ...
GATE will infer the parameter’s name from the name of the JavaBean property in the usual way (i.e. strip off the leading set and convert the following character to lower case, so in this example the name is abbrListUrl). The parameter name is not taken from the name of the method parameter. The parameter’s type is inferred from the type of the method parameter (java.net.URL in this case).
The annotation elements of @CreoleParameter are as follows:
-
comment
-
(String) an optional descriptive comment about the parameter.
-
defaultValue
-
(String) the optional default value for this parameter. The value is specified as a string but is converted to the relevant type by GATE according to the conversions described below.
-
suffixes
-
(String) for parameters of type URL or ResourceReference, a semicolon-separated list of default file suffixes that this parameter accepts.
-
collectionElementType
-
(Class) for Collection-valued parameters, the type of the elements in the collection. This can usually be inferred from the generic type information, for example public void setIndices(List<Integer> indices), but must be specified if the set method’s parameter has a raw (non-parameterized) type.
Parameter default values must be specified as strings, but parameters can be of any type and GATE applies the following rules to convert the default string into an appropriate value for the parameter type:
-
String
-
if the parameter is of type String the default value is used directly
-
Primitive wrapper types e.g. Integer
-
the string is passed to the relevant valueOf method
-
enum types
-
the value is passed to Enum.valueOf
-
java.net.URL or gate.creole.ResourceReference
-
the string is parsed as a URI, and if the URI is relative then it is resolved against the plugin (for directory-style plugins this means against the location of creole.xml and for Maven plugins it is the root of the plugin JAR file)
-
collection types (Set, List, etc.)
-
the string is treated as a semicolon-separated list of values, and each value is converted to the collection’s element type following these same rules.
-
gate.FeatureMap
-
the string is parsed as “feature1=value1;feature2=value2” etc. (a semicolon-separated list of “name=value” pairs)
-
any other java.* type
-
if the type has a constructor taking a String then that constructor is called with the default string as its parameter.
If there is no default specified, the default value is null.
Mutually-exclusive parameters are handled by adding a disjunction="label" and priority=n to the @CreoleParameter annotation – all parameters that share the same label are grouped in the same disjunction, and will be offered in order of priority. The parameter with the smallest priority value will be the one listed first, and thus the one that is offered initially when creating a resource of this type in GATE Developer. For example, the following is a simplified extract from gate.corpora.DocumentImpl:
2public void setSourceUrl(URL src) { /∗ ∗/ }
3
4@CreoleParameter(disjunction="src", priority=2)
5public void setStringContent(String content) { /∗ ∗/ }
This declares the parameters “stringContent” and “sourceUrl” as mutually-exclusive, and when creating an instance of this resource in GATE Developer the parameter that will be shown initially is sourceUrl. To set stringContent instead the user must select it from the drop-down list. Parameters with the same declared priority value will appear next to each other in the list, but their relative ordering is not specified. Parameters with no explicit priority are always listed after those that do specify a priority.
Optional and runtime parameters are marked using extra annotations, for example:
Runtime parameters apply only to Processing Resources, and are parameters that are not used when the resource is initialised but instead only when it is executed. An “optional” parameter is one that does not have to be set before creating or executing the resource.
Inheritance
A resource will inherit any configuration data that was not explicitly specified from annotations on its parent class and on any interfaces it implements. Specifically, if you do not specify a comment, interfaceName, icon, annotationTypeDisplayed or the GUI-related elements (guiType and resourceDisplayed) on your @CreoleResource annotation then GATE will look up the class tree for other @CreoleResource annotations, first on the superclass, its superclass, etc., then at any implemented interfaces, and use the first value it finds. This is useful if you are defining a family of related resources that inherit from a common base class.
The resource name and the isPrivate and mainViewer flags are not inherited.
Parameter definitions are inherited in a similar way. For example, the gate.LanguageAnalyser interface provides two parameter definitions via annotated set methods, for the corpus and document parameters. Any @CreoleResource annotated class that implements LanguageAnalyser, directly or indirectly, will get these parameters automatically.
Of course, there are some cases where this behaviour is not desirable, for example if a subclass calculates a value for a superclass parameter rather than having the user set it directly. In this case you can hide the parameter by overriding the set method in the subclass and using a marker annotation:
2 public void setSomeParam(String someParam) {
3 super.setSomeParam(someParam);
4 }
The overriding method will typically just call the superclass one, as its only purpose is to provide a place to put the @HiddenCreoleParameter annotation.
Alternatively, you may want to override some of the configuration for a parameter but inherit the rest from the superclass. Again, this is handled by trivially overriding the set method and re-annotating it:
2 @CreoleParameter(comment = "Location of the grammar file",
3 suffixes = "jape")
4 public void setGrammarUrl(URL grammarLocation) {
5 ...
6 }
7
8 @Optional
9 @RunTime
10 @CreoleParameter(comment = "Feature to set on success")
11 public void setSuccessFeature(String name) {
12 ...
13 }
2 // subclass
3
4 // override the default value, inherit everything else
5 @CreoleParameter(defaultValue = "resources/defaultGrammar.jape")
6 public void setGrammarUrl(URL url) {
7 super.setGrammarUrl(url);
8 }
9
10 // we want the parameter to be required in the subclass
11 @Optional(false)
12 @CreoleParameter
13 public void setSuccessFeature(String name) {
14 super.setSuccessFeature(name);
15 }
Note that for backwards compatibility, data is only inherited from superclass annotations if the subclass is itself annotated with @CreoleResource.
4.7.2 Loading Third-Party Libraries in a Maven plugin [#]
A Maven plugin is distributed as a single JAR file, but if the plugin depends on any third-party libraries these can be specified as dependencies in the corresponding POM file in the usual Maven way as compile or runtime scoped dependencies.
If one plugin has a compile-time dependency on another (as opposed to simply a runtime dependency when one plugin creates resources defined in another) then you should specify the dependency in your POM as <scope>provided</scope> as well as declaring it in creole.xml with group/artifact/version.
4.8 Tools: How to Add Utilities to GATE Developer [#]
Visual Resources allow a developer to provide a GUI to interact with a particular resource type (PR or LR), but sometimes it is useful to provide general utilities for use in the GATE Developer GUI that are not tied to any specific resource type. Examples include the annotation diff tool and the Groovy console (provided by the Groovy plugin), both of which are self-contained tools that display in their own top-level window. To support this, the CREOLE model has the concept of a tool.
A resource type is marked as a tool by setting tool = true in the @CreoleResource annotation. If a resource is declared to be a tool, and written to implement the gate.gui.ActionsPublisher interface, then whenever an instance of the resource is created its published actions will be added to the “Tools” menu in GATE Developer.
Since the published actions of every instance of the resource will be added to the tools menu, it is best not to use this mechanism on resource types that can be instantiated by the user. The “tool” marker is best used in combination with the “private” flag (to hide the resource from the list of available types in the GUI) and one or more hidden autoinstance definitions to create a limited number of instances of the resource when its defining plugin is loaded. See the GroovySupport resource in the Groovy plugin for an example of this.
4.8.1 Putting Your Tools in a Sub-Menu [#]
If your plugin provides a number of tools (or a number of actions from the same tool) you may wish to organise your actions into one or more sub-menus, rather than placing them all on the single top-level tools menu. To do this, you need to put a special value into the actions returned by the tool’s getActions() method:
The key must be GateConstants.MENU_PATH_KEY and the value must be an array of strings. Each string in the array represents the name of one level of sub-menus. Thus in the example above the action would be placed under “Tools → Acme toolkit → Statistics”. If no MENU_PATH_KEY value is provided the action will be placed directly on the Tools menu.
4.8.2 Adding Tools To Existing Resource Types [#]
While Visual Resources (VR) allow you to add new features to a particular resource they have a number of shortcomings. Firstly not every new feature will require a full VR; often a new entry on the resources right-click menu will suffice. More importantly new feautres added via a VR are only available while the VR is open. A Resource Helper is a form of Tool, as above, which can add new menu options to any existing resource type without requiring a VR.
A Resource Helper is defined in the same way as a Tool (by setting the tool = true feature of the @CreoleResource annotation and loaded via an autoinstance definition) but must also extend the gate.gui.ResourceHelper class. A Resource Helper can then return a set of actions for a given resource which will be added to its right-click menu. See the FastInfosetExporter resource in the “Format: FastInfoset” plugin for an example of how this works.
A Resource Helper may also make new API calls accessable to allow similar functionality to be made available to GATE Embedded, see Section 7.19 for more details on how this works.
Chapter 5
Language Resources: Corpora, Documents and Annotations [#]
This chapter documents GATE’s model of corpora, documents and annotations on documents. Section 5.1 describes the simple attribute/value data model that corpora, documents and annotations all share. Section 5.2, Section 5.3 and Section 5.4 describe corpora, documents and annotations on documents respectively. Section 5.5 describes GATE’s support for diverse document formats, and Section 5.5.2 describes facilities for XML input/output.
5.1 Features: Simple Attribute/Value Data [#]
GATE has a single model for information that describes documents, collections of documents (corpora), and annotations on documents, based on attribute/value pairs. Attribute names are strings; values can be any Java object. The API for accessing this feature data is Java’s Map interface (part of the Collections API).
5.2 Corpora: Sets of Documents plus Features [#]
A Corpus in GATE is a Java Set whose members are Documents. Both Corpora and Documents are types of LanguageResource (LR); all LRs have a FeatureMap (a Java Map) associated with them that stored attribute/value information about the resource. FeatureMaps are also used to associate arbitrary information with ranges of documents (e.g. pieces of text) via the annotation model (see below).
Documents have a DocumentContent which is a text at present (future versions may add support for audiovisual content) and one or more AnnotationSets which are Java Sets.
5.3 Documents: Content plus Annotations plus Features [#]
Documents are modelled as content plus annotations (see Section 5.4) plus features (see Section 5.1). The content of a document can be any subclass of DocumentContent.
5.4 Annotations: Directed Acyclic Graphs [#]
Annotations are organised in graphs, which are modelled as Java sets of Annotation. Annotations may be considered as the arcs in the graph; they have a start Node and an end Node, an ID, a type and a FeatureMap. Nodes have pointers into the sources document, e.g. character offsets.
5.4.1 Annotation Schemas [#]
Annotation schemas provide a means to define types of annotations in GATE. GATE uses the XML Schema language supported by W3C for these definitions. When using GATE Developer to create/edit annotations, a component is available (gate.gui.SchemaAnnotationEditor) which is driven by an annotation schema file. This component will constrain the data entry process to ensure that only annotations that correspond to a particular schema are created. (Another component allows unrestricted annotations to be created.)
Schemas are resources just like other GATE components. Below we give some examples of such schemas. Section 3.4.6 describes how to create new schemas. Note that each schema file defines a single annotation type, however it is possible to use include definitions in a schema to refer to other schemas in order to load a whole set of schemas as a group. The default schemas for ANNIE annotation types (defined in resources/schema in the ANNIE plugin) give an example of this technique.
Date Schema
<?xml version="1.0"?> <schema xmlns="http://www.w3.org/2000/10/XMLSchema"> <!-- XSchema deffinition for Date--> <element name="Date"> <complexType> <attribute name="kind" use="optional"> <simpleType> <restriction base="string"> <enumeration value="date"/> <enumeration value="time"/> <enumeration value="dateTime"/> </restriction> </simpleType> </attribute> </complexType> </element> </schema>
Person Schema
<?xml version="1.0"?> <schema xmlns="http://www.w3.org/2000/10/XMLSchema"> <!-- XSchema definition for Person--> <element name="Person" /> </schema>
Address Schema
<?xml version="1.0"?> <schema xmlns="http://www.w3.org/2000/10/XMLSchema"> <!-- XSchema definition for Address--> <element name="Address"> <complexType> <attribute name="kind" use="optional"> <simpleType> <restriction base="string"> <enumeration value="email"/> <enumeration value="url"/> <enumeration value="phone"/> <enumeration value="ip"/> <enumeration value="street"/> <enumeration value="postcode"/> <enumeration value="country"/> <enumeration value="complete"/> </restriction> </simpleType> </attribute> </complexType> </element> </schema>
5.4.2 Examples of Annotated Documents [#]
This section shows some simple examples of annotated documents.
This material is adapted from [Grishman 97], the TIPSTER Architecture Design document upon which GATE version 1 was based. Version 2 has a similar model, although annotations are now graphs, and instead of multiple spans per annotation each annotation now has a single start/end node pair. The current model is largely compatible with [Bird & Liberman 99], and roughly isomorphic with "stand-off markup" as latterly adopted by the SGML/XML community.
Each example is shown in the form of a table. At the top of the table is the document being annotated; immediately below the line with the document is a ruler showing the position (byte offset) of each character (see TIPSTER Architecture Design Document).
Underneath this appear the annotations, one annotation per line. For each annotation is shown its Id, Type, Span (start/end offsets derived from the start/end nodes), and Features. Integers are used as the annotation Ids. The features are shown in the form name = value.
The first example shows a single sentence and the result of three annotation procedures: tokenization with part-of-speech assignment, name recognition, and sentence boundary recognition. Each token has a single feature, its part of speech (pos), using the tag set from the University of Pennsylvania Tree Bank; each name also has a single feature, indicating the type of name: person, company, etc.
Text | ||||
Cyndi savored the soup.
| ||||
^0...^5...^10..^15..^20 | ||||
Annotations
| ||||
Id | Type | SpanStart | Span End | Features |
1 | token | 0 | 5 | pos=NP |
2 | token | 6 | 13 | pos=VBD |
3 | token | 14 | 17 | pos=DT |
4 | token | 18 | 22 | pos=NN |
5 | token | 22 | 23 | |
6 | name | 0 | 5 | name_type=person |
7 | sentence | 0 | 23 | |
Annotations will typically be organized to describe a hierarchical decomposition of a text. A simple illustration would be the decomposition of a sentence into tokens. A more complex case would be a full syntactic analysis, in which a sentence is decomposed into a noun phrase and a verb phrase, a verb phrase into a verb and its complement, etc. down to the level of individual tokens. Such decompositions can be represented by annotations on nested sets of spans. Both of these are illustrated in the second example, which is an elaboration of our first example to include parse information. Each non-terminal node in the parse tree is represented by an annotation of type parse.
Text | ||||
Cyndi savored the soup.
| ||||
^0...^5...^10..^15..^20 | ||||
Annotations
| ||||
Id | Type | SpanStart | Span End | Features |
1 | token | 0 | 5 | pos=NP |
2 | token | 6 | 13 | pos=VBD |
3 | token | 14 | 17 | pos=DT |
4 | token | 18 | 22 | pos=NN |
5 | token | 22 | 23 | |
6 | name | 0 | 5 | name_type=person |
7 | sentence | 0 | 23 | constituents=[1],[2],[3].[4],[5] |
In most cases, the hierarchical structure could be recovered from the spans. However, it may be desirable to record this structure directly through a constituents feature whose value is a sequence of annotations representing the immediate constituents of the initial annotation. For the annotations of type parse, the constituents are either non-terminals (other annotations in the parse group) or tokens. For the sentence annotation, the constituents feature points to the constituent tokens. A reference to another annotation is represented in the table as "[ Annotation Id]"; for example, "[3]" represents a reference to annotation 3. Where the value of an feature is a sequence of items, these items are separated by commas. No special operations are provided in the current architecture for manipulating constituents. At a less esoteric level, annotations can be used to record the overall structure of documents, including in particular documents which have structured headers, as is shown in the third example (Table 5.3).
Text | ||||
To: All Barnyard Animals
| ||||
^0...^5...^10..^15..^20. | ||||
From: Chicken Little
| ||||
^25..^30..^35..^40.. | ||||
Date: November 10,1194
| ||||
...^50..^55..^60..^65. | ||||
Subject: Descending Firmament
| ||||
.^70..^75..^80..^85..^90..^95 | ||||
Priority: Urgent
| ||||
.^100.^105.^110. | ||||
The sky is falling. The sky is falling.
| ||||
....^120.^125.^130.^135.^140.^145.^150.
| ||||
Annotations
| ||||
Id | Type | SpanStart | Span End | Features |
1 | Addressee | 4 | 24 | |
2 | Source | 31 | 45 | |
3 | Date | 53 | 69 | ddmmyy=101194 |
4 | Subject | 78 | 98 | |
5 | Priority | 109 | 115 | |
6 | Body | 116 | 155 | |
7 | Sentence | 116 | 135 | |
8 | Sentence | 136 | 155 | |
If the Addressee, Source, ... annotations are recorded when the document is indexed for retrieval, it will be possible to perform retrieval selectively on information in particular fields. Our final example (Table 5.4) involves an annotation which effectively modifies the document. The current architecture does not make any specific provision for the modification of the original text. However, some allowance must be made for processes such as spelling correction. This information will be recorded as a correction feature on token annotations and possibly on name annotations:
Text | ||||
Topster tackles 2 terrorbytes.
| ||||
^0...^5...^10..^15..^20..^25.. | ||||
Annotations
| ||||
Id | Type | SpanStart | Span End | Features |
1 | token | 0 | 7 | pos=NP correction=TIPSTER |
2 | token | 8 | 15 | pos=VBZ |
3 | token | 16 | 17 | pos=CD |
4 | token | 18 | 29 | pos=NNS correction=terabytes |
5 | token | 29 | 30 | |
5.4.3 Creating, Viewing and Editing Diverse Annotation Types [#]
Note that annotation types should consist of a single word with no spaces. Otherwise they may not be recognised by other components such as JAPE transducers, and may create problems when annotations are saved as inline (‘Save Preserving Format’ in the context menu).
To view and edit annotation types, see Section 3.4. To add annotations of a new type, see Section 3.4.5. To add a new annotation schema, see Section 3.4.6.
5.5 Document Formats [#]
The following document formats are supported by GATE by default:
-
Plain Text
-
HTML
-
SGML
-
XML
-
RTF
-
Email
-
PDF (some documents)
-
Microsoft Office (some formats)
-
OpenOffice (some formats)
-
UIMA CAS XML format
-
CoNLL/IOB
Additional formats are provided by plugins – you must load the relevant plugin before attempting to parse these document types
-
Twitter JSON (in the Twitter plugin, see section 17.2)
-
GATE JSON (in the Format_JSON plugin, see section 23.30
-
DataSift JSON, a common format for social media data from http://datasift.com (in the Format_DataSift plugin, see section 23.32)
-
FastInfoset, a compressed binary encoding of GATE XML (in the Format_FastInfoset plugin, see section 23.29)
-
MediaWiki markup, as used by Wikipedia and many other public wiki sites (in the Format_MediaWiki plugin, see section 23.28)
-
The formats used by PubMed and the Cochrane collaboration for biomedical literature (in the Format_PubMed plugin, see section 23.27)
-
CSV files containing one column of text data and optionally additional columns of metadata (in the Format_CSV plugin, see section 23.33)
By default GATE will try and identify the type of the document, then strip and convert any markup into GATE’s annotation format. To disable this process, set the markupAware parameter on the document to false.
When reading a document of one of these types, GATE extracts the text between tags (where such exist) and create a GATE annotation filled as follows:
The name of the tag will constitute the annotation’s type, all the tags attributes will materialize in the annotation’s features and the annotation will span over the text covered by the tag. A few exceptions of this rule apply for the RTF, Email and Plain Text formats, which will be described later in the input section of these formats.
The text between tags is extracted and appended to the GATE document’s content and all annotations created from tags will be placed into a GATE annotation set named ‘Original markups’.
Example:
If the markup is like this:
<aTagName attrib1="value1" attrib2="value2" attrib3="value3"> A piece of text</aTagName>
then the annotation created by GATE will look like:
annotation.type = "aTagName"; annotation.fm = {attrib1=value1;atrtrib2=value2;attrib3=value3}; annotation.start = startNode; annotation.end = endNode;
The startNode and endNode are created from offsets referring the beginning and the end of ‘A piece of text’ in the document’s content.
The documents supported by GATE have to be in one of the encodings accepted by Java. The most popular is the ‘UTF-8’ encoding which is also the most storage efficient one for UNICODE. If, when loading a document in GATE the encoding parameter is set to ‘’(the empty string), then the default encoding of the platform will be used.
5.5.1 Detecting the Right Reader [#]
In order to successfully apply the document creation algorithm described above, GATE needs to detect the proper reader to use for each document format. If the user knows in advance what kind of document they are loading then they can specify the MIME type (e.g. text/html) using the init parameter mimeType, and GATE will respect this. If an explicit type is not given, GATE attempts to determine the type by other means, taking into consideration (where possible) the information provided by three sources:
-
Document’s extension
-
The web server’s content type
-
Magic numbers detection
The first represents the extension of a file like (xml,htm,html,txt,sgm,rtf, etc), the second represents the HTTP information sent by a web server regarding the content type of the document being send by it (text/html; text/xml, etc), and the third one represents certain sequences of chars which are ultimately number sequences. GATE is capable of supporting multimedia documents, if the right reader is added to the framework. Sometimes, multimedia documents are identified by a signature consisting in a sequence of numbers. Inside GATE they are called magic numbers. For textual documents, certain char sequences form such magic numbers. Examples of magic numbers sequences will be provided in the Input section of each format supported by GATE.
All those tests are applied to each document read, and after that, a voting mechanism decides what is the best reader to associate with the document. There is a degree of priority for all those tests. The document’s extension test has the highest priority. If the system is in doubt which reader to choose, then the one associated with document’s extension will be selected. The next higher priority is given to the web server’s content type and the third one is given to the magic numbers detection. However, any two tests that identify the same mime type, will have the highest priority in deciding the reader that will be used. The web server test is not always successful as there might be documents that are loaded from a local file system, and the magic number detection test is not always applicable. In the next paragraphs we will se how those tests are performed and what is the general mechanism behind reader detection.
The method that detects the proper reader is a static one, and it belongs to the gate.DocumentFormat class. It uses the information stored in the maps filled by the init() method of each reader. This method comes with three signatures:
2aGateDocument, URL url)
3
4static public DocumentFormat getDocumentFormat(gate.Document
5aGateDocument, String fileSuffix)
6
7static public DocumentFormat getDocumentFormat(gate.Document
8aGateDocument, MimeType mimeType)
The first two methods try to detect the right MimeType for the GATE document, and after that, they call the third one to return the reader associate with a MimeType. Of course, if an explicit mimeType parameter was specified, GATE calls the third form of the method directly, passing the specified type. GATE uses the implementation from ‘http://jigsaw.w3.org’ for mime types.
The magic numbers test is performed using the information form
magic2mimeTypeMap map. Each key from this map, is searched in the first bufferSize (the default
value is 2048) chars of text. The method that does this is called
runMagicNumbers(InputStreamReader aReader) and it belongs to DocumentFormat class. More
details about it can be found in the GATE API documentation.
In order to activate a reader to perform the unpacking, the creole definition of a GATE document defines a parameter called ‘markupAware’ initialized with a default value of true. This parameter, forces GATE to detect a proper reader for the document being read. If no reader is found, the document’s content is load and presented to the user, just like any other text editor (this for textual documents).
You can also use Tika format auto-detection by setting the mimeType of a document to "application/tika". Then the document will be parsed only by Tika.
The next subsections investigates particularities for each format and will describe the file extensions registered with each document format.
5.5.2 XML [#]
Input [#]
GATE permits the processing of any XML document and offers support for XML namespaces. It benefits the power of Apache’s Xerces parser and also makes use of Sun’s JAXP layer. Changing the XML parser in GATE can be achieved by simply replacing the value of a Java system property (‘javax.xml.parsers.SAXParserFactory’).
GATE will accept any well formed XML document as input. Although it has the possibility to validate XML documents against DTDs it does not do so because the validating procedure is time consuming and in many cases it issues messages that are annoying for the user.
There is an open problem with the general approach of reading XML, HTML and SGML documents in GATE. As we previously said, the text covered by tags/elements is appended to the GATE document content and a GATE annotation refers to this particular span of text. When appending, in cases such as ‘end.</P><P>Start’ it might happen that the ending word of the previous annotation is concatenated with the beginning phrase of the annotation currently being created, resulting in a garbage input for GATE processing resources that operate at the text surface.
Let’s take another example in order to better understand the problem:
<title>This is a title</title><p>This is a paragraph</p><a href="#link">Here is an useful link</a>
When the markup is transformed to annotations, it is likely that the text from the document’s content will be as follows:
This is a titleThis is a paragraphHere is an useful link
The annotations created will refer the right parts of the texts but for the GATE’s processing resources like (tokenizer, gazetteer, etc) which work on this text, this will be a major disaster. Therefore, in order to prevent this problem from happening, GATE checks if it’s likely to join words and if this happens then it inserts a space between those words. So, the text will look like this after loaded in GATE Developer:
This is a title This is a paragraph Here is an useful link
There are cases when these words are meant to be joined, but they are rare. This is why it’s an open problem. If you need to disable these spaces in GATE Developer, select Options, Configuration, and then the Advanced tab in the configuration dialog; untick the box beside Add space on markup unpack if needed. You can re-enable the spaces later if you wish. This option will persist between sessions if Save options on exit (in the same dialog) is turned on.
Programmatically, this can be controlled with the following code:
Gate.getUserConfig().put(GateConstants.DOCUMENT_ADD_SPACE_ON_UNPACK_FEATURE_NAME, enabled);
where enabled is a boolean or Boolean.
The extensions associate with the XML reader are:
-
xml
-
xhtm
-
xhtml
The web server content type associate with xml documents is: text/xml.
The magic numbers test searches inside the document for the XML(<?xml version="1.0") signature. It is also able to detect if the XML document uses the semantics described in the GATE document format DTD (see 1 below) or uses other semantics.
Namespace handling
By default, GATE will retain the namespace prefix and namespace URIs of XML elements when creating annotations and features within the Original markups annotation set. For example, the element
<dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Document title</dc:title>
will create the following annotation
dc:title(xmlns:dc=http://purl.org/dc/elements/1.1/)
However, as the colon character ’:’ is a reserved meta-character in JAPE, it is not possible to write a JAPE rule that will match the dc:title element or its namespace URI.
If you need to match namespace-prefixed elements in the Original markups AS, you can alter the default namespace deserialization behaviour to remove the namespace prefix and add it as a feature (along with the namespace URI), by specifying the following attributes in the <GATECONFIG> element of gate.xml or local configuration file:
-
addNamespaceFeatures - set to "true" to deserialize namespace prefix and uri information as features.
-
namespaceURI - The feature name to use that will hold the namespace URI of the element, e.g. "namespace"
-
namespacePrefix - The feature name to use that will hold the namespace prefix of the element, e.g. "prefix"
i.e.
<GATECONFIG addNamespaceFeatures="true" namespaceURI="namespace" namespacePrefix="prefix" />
For example
<dc:title>Document title</dc:title>
would create in Original markups AS (assuming the xmlns:dc URI has defined in the document root or parent element)
title(prefix=dc, namespace=http://purl.org/dc/elements/1.1/)
If a JAPE rule is written to create a new annotation, e.g.
description(prefix=foo, namespace=http://www.example.org/)
then these would be serialized to
<dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Document title</dc:title> <foo:description xmlns:foo="http://www.example.org/">...</foo:description>
when using the ’Save preserving document format’ XML output option (see 1 below).
Output [#]
GATE is capable of ensuring persistence for its resources. The types of persistent storage used for Language Resources are:
-
Java serialization;
-
XML serialization.
We describe the latter case here.
XML persistence doesn’t necessarily preserve all the objects belonging to the annotations, documents or corpora. Their features can be of all kinds of objects, with various layers of nesting. For example, lists containing lists containing maps, etc. Serializing these arbitrary data types in XML is not a simple task; GATE does the best it can, and supports native Java types such as Integers and Booleans, but where complex data types are used, information may be lost(the types will be converted into Strings). GATE provides a full serialization of certain types of features such as collections, strings and numbers. It is possible to serialize only those collections containing strings or numbers. The rest of other features are serialized using their string representation and when read back, they will be all strings instead of being the original objects. Consequences of this might be observed when performing evaluations (see Chapter 10).
When GATE outputs an XML document it may do so in one of two ways:
-
When the original document that was imported into GATE was an XML document, GATE can dump that document back into XML (possibly with additional markup added);
-
For all document formats, GATE can dump its internal representation of the document into XML.
In the former case, the XML output will be close to the original document. In the latter case, the format is a GATE-specific one which can be read back by the system to recreate all the information that GATE held internally for the document.
In order to understand why there are two types of XML serialization, one needs to understand the structure of a GATE document. GATE allows a graph of annotations that refer to parts of the text. Those annotations are grouped under annotation sets. Because of this structure, sometimes it is impossible to save a document as XML using tags that surround the text referred to by the annotation, because tags crossover situations could appear (XML is essentially a tree-based model of information, whereas GATE uses graphs). Therefore, in order to preserve all annotations in a GATE document, a custom type of XML document was developed.
The problem of crossover tags appears with GATE’s second option (the preserve format one), which is implemented at the cost of losing certain annotations. The way it is applied in GATE is that it tries to restore the original markup and where it is possible, to add in the same manner annotations produced by GATE.
How to Access and Use the Two Forms of XML Serialization
Save as XML Option [#]
This option is available in GATE Developer in the pop-up menu associated with each language resource (document or corpus). Saving a corpus as XML is done by calling ‘Save as XML’ on each document of the corpus. This option saves all the annotations of a document together their features(applying the restrictions previously discussed), using the GateDocument.dtd :
<!ELEMENT GateDocument (GateDocumentFeatures, TextWithNodes, (AnnotationSet+))> <!ELEMENT GateDocumentFeatures (Feature+)> <!ELEMENT Feature (Name, Value)> <!ELEMENT Name (\#PCDATA)> <!ELEMENT Value (\#PCDATA)> <!ELEMENT TextWithNodes (\#PCDATA | Node)*> <!ELEMENT AnnotationSet (Annotation*)> <!ATTLIST AnnotationSet Name CDATA \#IMPLIED> <!ELEMENT Annotation (Feature*)> <!ATTLIST Annotation Type CDATA \#REQUIRED StartNode CDATA \#REQUIRED EndNode CDATA \#REQUIRED> <!ELEMENT Node EMPTY> <!ATTLIST Node id CDATA \#REQUIRED>
The document is saved under a name chosen by the user and it may have any extension. However, the recommended extension would be ‘xml’.
Using GATE Embedded, this option is available by calling gate.Document’s toXml() method. This method returns a string which is the XML representation of the document on which the method was called.
Note: It is recommended that the string representation to be saved on the file system using the UTF-8 encoding, as the first line of the string is : <?xml version="1.0" encoding="UTF-8"?>
Example of such a GATE format document:
<?xml version="1.0" encoding="UTF-8" ?> <GateDocument> <!-- The document’s features--> <GateDocumentFeatures> <Feature> <Name className="java.lang.String">MimeType</Name> <Value className="java.lang.String">text/plain</Value> </Feature> <Feature> <Name className="java.lang.String">gate.SourceURL</Name> <Value className="java.lang.String">file:/G:/tmp/example.txt</Value> </Feature> </GateDocumentFeatures> <!-- The document content area with serialized nodes --> <TextWithNodes> <Node id="0"/>A TEENAGER <Node id="11"/>yesterday<Node id="20"/> accused his parents of cruelty by feeding him a daily diet of chips which sent his weight ballooning to 22st at the age of l2<Node id="146"/>.<Node id="147"/> </TextWithNodes> <!-- The default annotation set --> <AnnotationSet> <Annotation Type="Date" StartNode="11" EndNode="20"> <Feature> <Name className="java.lang.String">rule2</Name> <Value className="java.lang.String">DateOnlyFinal</Value> </Feature> <Feature> <Name className="java.lang.String">rule1</Name> <Value className="java.lang.String">GazDateWords</Value> </Feature> <Feature> <Name className="java.lang.String">kind</Name> <Value className="java.lang.String">date</Value> </Feature> </Annotation> <Annotation Type="Sentence" StartNode="0" EndNode="147"> </Annotation> <Annotation Type="Split" StartNode="146" EndNode="147"> <Feature> <Name className="java.lang.String">kind</Name> <Value className="java.lang.String">internal</Value> </Feature> </Annotation> <Annotation Type="Lookup" StartNode="11" EndNode="20"> <Feature> <Name className="java.lang.String">majorType</Name> <Value className="java.lang.String">date_key</Value> </Feature> </Annotation> </AnnotationSet> <!-- Named annotation set --> <AnnotationSet Name="Original markups" > <Annotation Type="paragraph" StartNode="0" EndNode="147"> </Annotation> </AnnotationSet> </GateDocument>
Note: One must know that all features that are not collections containing numbers or strings or that are not numbers or strings are discarded. With this option, GATE does not preserve those features it cannot restore back.
The Preserve Format Option This option is available in GATE Developer from the popup menu of the annotations table. If no annotation in this table is selected, then the option will restore the document’s original markup. If certain annotations are selected, then the option will attempt to restore the original markup and insert all the selected ones. When an annotation violates the crossed over condition, that annotation is discarded and a message is issued.
This option makes it possible to generate an XML document with tags surrounding the annotation’s referenced text and features saved as attributes. All features which are collections, strings or numbers are saved, and the others are discarded. However, when read back, only the attributes under the GATE namespace (see below) are reconstructed back differently to the others. That is because GATE does not store in the XML document the information about the features class and for collections the class of the items. So, when read back, all features will become strings, except those under the GATE namespace.
One will notice that all generated tags have an attribute called ‘gateId’ under the namespace ‘http://www.gate.ac.uk’. The attribute is used when the document is read back in GATE, in order to restore the annotation’s old ID. This feature is needed because it works in close cooperation with another attribute under the same namespace, called ‘matches’. This attribute indicates annotations/tags that refer the same entity1. They are under this namespace because GATE is sensitive to them and treats them differently to all other elements with their attributes which fall under the general reading algorithm described at the beginning of this section.
The ‘gateId’ under GATE namespace is used to create an annotation which has as ID the value indicated by this attribute. The ‘matches’ attribute is used to create an ArrayList in which the items will be Integers, representing the ID of annotations that the current one matches.
Example:
If the text being processed is as follows:
<Person gate:gateId="23">John</Person> and <Person gate:gateId="25" gate:matches="23;25;30">John Major</Person> are the same person.
What GATE does when it parses this text is it creates two annotations:
a1.type = "Person" a1.ID = Integer(23) a1.start = <the start offset of John> a1.end = <the end offset of John> a1.featureMap = {} a2.type = "Person" a2.ID = Integer(25) a2.start = <the start offset of John Major> a2.end = <the end offset of John Major> a2.featureMap = {matches=[Integer(23); Integer(25); Integer(30)]}
Under GATE Embedded, this option is available by calling gate.Document’s toXml(Set aSetContainingAnnotations) method. This method returns a string which is the XML representation of the document on which the method was called. If called with null as a parameter, then the method will attempt to restore only the original markup. If the parameter is a set that contains annotations, then each annotation is tested against the crossover restriction, and for those found to violate it, a warning will be issued and they will be discarded.
In the next subsections we will show how this option applies to the other formats supported by GATE.
5.5.3 HTML [#]
Input
HTML documents are parsed by GATE using the NekoHTML parser. The documents are read and created in GATE the same way as the XML documents.
The extensions associate with the HTML reader are:
-
htm
-
html
The web server content type associate with html documents is: text/html.
The magic numbers test searches inside the document for the HTML(<html) signature.There are certain HTML documents that do not contain the HTML tag, so the magical numbers test might not hold.
There is a certain degree of customization for HTML documents in that GATE introduces new lines into the document’s text content in order to obtain a readable form. The annotations will refer the pieces of text as described in the original document but there will be a few extra new line characters inserted.
After reading H1, H2, H3, H4, H5, H6, TR, CENTER, LI, BR and DIV tags, GATE will introduce a new line (NL) char into the text. After a TITLE tag it will introduce two NLs. With P tags, GATE will introduce one NL at the beginning of the paragraph and one at the end of the paragraph. All newly added NLs are not considered to be part of the text contained by the tag.
Output
The ‘Save as XML’ option works exactly the same for all GATE’s documents so there is no particular observation to be made for the HTML formats.
When attempting to preserve the original markup formatting, GATE will generate the document in xhtml. The html document will look the same with any browser after processed by GATE but it will be in another syntax.
5.5.4 SGML [#]
Input
The SGML support in GATE is fairly light as there is no freely available Java SGML parser. GATE uses a light converter attempting to transform the input SGML file into a well formed XML. Because it does not make use of a DTD, the conversion might not be always good. It is advisable to perform a SGML2XML conversion outside the system(using some other specialized tools) before using the SGML document inside GATE.
The extensions associate with the SGML reader are:
-
sgm
-
sgml
The web server content type associate with xml documents is : text/sgml.
There is no magic numbers test for SGML.
Output
When attempting to preserve the original markup formatting, GATE will generate the document as XML because the real input of a SGML document inside GATE is an XML one.
5.5.5 Plain text [#]
Input
When reading a plain text document, GATE attempts to detect its paragraphs and add ‘paragraph’ annotations to the document’s ‘Original markups’ annotation set. It does that by detecting two consecutive NLs. The procedure works for both UNIX like or DOS like text files.
Example:
If the plain text read is as follows:
Paragraph 1. This text belongs to the first paragraph. Paragraph 2. This text belongs to the second paragraph
then two ‘paragraph’ type annotation will be created in the ‘Original markups’ annotation set (referring the first and second paragraphs ) with an empty feature map.
The extensions associate with the plain text reader are:
-
txt
-
text
The web server content type associate with plain text documents is: text/plain.
There is no magic numbers test for plain text.
Output
When attempting to preserve the original markup formatting, GATE will dump XML markup that surrounds the text refereed.
The procedure described above applies both for plain text and RTF documents.
5.5.6 RTF [#]
Input
Accessing RTF documents is performed by using the Java’s RTF editor kit. It only extracts the document’s text content from the RTF document.
The extension associate with the RTF reader is ‘rtf’.
The web server content type associate with xml documents is : text/rtf.
The magic numbers test searches for {∖∖rtf1.
Output
Same as the plain tex output.
5.5.7 Email [#]
Input
GATE is able to read email messages packed in one document (UNIX mailbox format). It detects multiple messages inside such documents and for each message it creates annotations for all the fields composing an e-mail, like date, from, to, subject, etc. The message’s body is analyzed and a paragraph detection is performed (just like in the plain text case) . All annotation created have as type the name of the e-mail’s fields and they are placed in the Original markup annotation set.
Example:
From someone@zzz.zzz.zzz Wed Sep 6 10:35:50 2000 Date: Wed, 6 Sep2000 10:35:49 +0100 (BST) From: forename1 surname2 <someone1@yyy.yyy.xxx> To: forename2 surname2 <someone2@ddd.dddd.dd.dd> Subject: A subject Message-ID: <Pine.SOL.3.91.1000906103251.26010A-100000@servername> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII This text belongs to the e-mail body.... This is a paragraph in the body of the e-mail This is another paragraph.
GATE attempts to detect lines such as ‘From someone@zzz.zzz.zzz Wed Sep 6 10:35:50 2000’ in the e-mail text. Those lines separate e-mail messages contained in one file. After that, for each field in the e-mail message annotations are created as follows:
The annotation type will be the name of the field, the feature map will be empty and the annotation will span from the end of the field until the end of the line containing the e-mail field.
Example:
a1.type = "date" a1 spans between the two ^ ^. Date:^ Wed, 6Sep2000 10:35:49 +0100 (BST)^ a2.type = "from"; a2 spans between the two ^ ^. From:^ forename1 surname2 <someone1@yyy.yyy.xxx>^
The extensions associated with the email reader are:
-
eml
-
email
-
mail
The web server content type associate with plain text documents is: text/email.
The magic numbers test searches for keywords like Subject:,etc.
Output
Same as plain text output.
5.5.8 PDF Files and Office Documents [#]
GATE uses the Apache Tika library to provide support for PDF documents and a number of the document formats from both Microsoft Office and OpenOffice. In essense Tika converts the document structure into HTML which is then used to create a GATE document. This means that whilst a PDF or Word document may have been loaded the “Original markups” set will contain HTML elements. One advantage of this approach is that processing resources and JAPE grammars designed for use with HTML files should also work well with PDF and Office documents.
5.5.9 UIMA CAS Documents [#]
GATE can read UIMA CAS documents. The CAS stands for Common Analysis Structure. It provides a common representation to the artifact being analyzed, here a text.
The subject of analysis (SOFA), here a string, is used as the document content. Multiple sofa are concatenated. The analysis results or metadata are added as annotations when having begin and end offsets and otherwise are added as document features. The views are added as GATE annotation sets. The type system (a hierarchical annotation schema) is not currently supported.
The web server content type associate with UIMA documents is: text/xmi+xml.
The extensions are: xcas, xmicas, xmi.
The magic numbers are:
<CAS version="2">
and
xmlns:cas=
5.5.10 CoNLL/IOB Documents [#]
GATE can read files of text annotated in the traditional CoNLL or BIO/BILOU format, typically used to represent POS tags and chunks and best known for Conference on Natural Language Learning2 tasks. The following example illustrates one sentence with POS and chunk tags (B- and I- indicate the beginning and continuation, respectively, of a chunk); the columns represent the tokens, the POS tags, and the chunk tags, and sentences are separated by blank lines.
My PRP$ B-NP dog NN I-NP has VBZ B-VP fleas NNS B-NP . . O
GATE interpets this format quite flexibly: the columns can be separated by any whitespace sequence, and the number of columns can vary. The strings from the leftmost column become strings in the document content, with spaces interposed, and Token and SpaceToken annotations (with string and length features) are created appropriately in the Original markups set).
Each blank line (empty or containing only whitespace) in the original data becomes a newline in the document content.
The tags in subsequent columns are transformed into annotations. A chunk tag (beginning with B- and followed by zero or more matching I- tags) produces an annotation whose type is determined by the rest of the tag (NP or VP in the above example, but any string with no whitespace is acceptable), with a kind = chunk feature. A chunk tag beginning with L- (last) terminates the chunk, and a U- (unigram) tag produces a chunk annotation over one token. Other tags produce annotations with the tag name as the type and a kind = token feature.
Every annotation derived from a tag has a column feature whose int value indicates the source column in the data (numbered from 0 for the string column). An “O” tag closes all open chunk tags at the end of the previous token.
This document format is associated with MIME-type text/x-conll and filename extensions .conll and .iob.
5.6 XML Input/Output [#]
Support for input from and output to XML is described in Section 5.5.2. In short:
-
GATE will read any well-formed XML document (it does not attempt to validate XML documents). Markup will by default be converted into native GATE format.
-
GATE will write back into XML in one of two ways:
-
Preserving the original format and adding selected markup (for example to add the results of some language analysis process to the document).
-
In GATE’s own XML serialisation format, which encodes all the data in a GATE Document (as far as this is possible within a tree-structured paradigm – for 100% non-lossy data storage use GATE’s RDBMS or binary serialisation facilities – see Section 4.5).
-
When using GATE Embedded, object representations of XML documents such as DOM or jDOM, or query and transformation languages such as X-Path or XSLT, may be used in parallel with GATE’s own Document representation (gate.Document) without conflicts.
Chapter 6
ANNIE: a Nearly-New Information Extraction System [#]
GATE was originally developed in the context of Information Extraction (IE) R&D, and IE systems in many languages and shapes and sizes have been created using GATE with the IE components that have been distributed with it (see [Maynard et al. 00] for descriptions of some of these projects).1
GATE is distributed with an IE system called ANNIE, A Nearly-New IE system (developed by Hamish Cunningham, Valentin Tablan, Diana Maynard, Kalina Bontcheva, Marin Dimitrov and others). ANNIE relies on finite state algorithms and the JAPE language (see Chapter 8).
ANNIE components form a pipeline which appears in figure 6.1.
ANNIE components are included with GATE (though the linguistic resources they rely on are generally more simple than the ones we use in-house). The rest of this chapter describes these components.
For the GATE Cloud version of ANNIE, see:
https://cloud.gate.ac.uk/shopfront/displayItem/annie-named-entity-recognizer
6.1 Document Reset [#]
The document reset resource enables the document to be reset to its original state, by removing all the annotation sets and their contents, apart from the one containing the document format analysis (Original Markups). An optional parameter, keepOriginalMarkupsAS, allows users to decide whether to keep the Original Markups AS or not while reseting the document. The parameter annotationTypes can be used to specify a list of annotation types to remove from all the sets instead of the whole sets.
Alternatively, if the parameter setsToRemove is not empty, the other parameters except annotationTypes are ignored and only the annotation sets specified in this list will be removed. If annotationTypes is also specified, only those annotation types in the specified sets are removed. In order to specify that you want to reset the default annotation set, just click the "Add" button without entering a name – this will add <null> which denotes the default annotation set. This resource is normally added to the beginning of an application, so that a document is reset before an application is rerun on that document.
6.2 Tokeniser [#]
The tokeniser splits the text into very simple tokens such as numbers, punctuation and words of different types. For example, we distinguish between words in uppercase and lowercase, and between certain types of punctuation. The aim is to limit the work of the tokeniser to maximise efficiency, and enable greater flexibility by placing the burden on the grammar rules, which are more adaptable.
6.2.1 Tokeniser Rules
A rule has a left hand side (LHS) and a right hand side (RHS). The LHS is a regular expression which has to be matched on the input; the RHS describes the annotations to be added to the AnnotationSet. The LHS is separated from the RHS by ‘>’. The following operators can be used on the LHS:
| (or) * (0 or more occurrences) ? (0 or 1 occurrences) + (1 or more occurrences)
The RHS uses ‘;’ as a separator, and has the following format:
{LHS} > {Annotation type};{attribute1}={value1};...;{attribute n}={value n}
Details about the primitive constructs available are given in the tokeniser file (DefaultTokeniser.Rules).
The following tokeniser rule is for a word beginning with a single capital letter:
‘UPPERCASE_LETTER’ ‘LOWERCASE_LETTER’* > Token;orth=upperInitial;kind=word;
It states that the sequence must begin with an uppercase letter, followed by zero or more lowercase letters. This sequence will then be annotated as type ‘Token’. The attribute ‘orth’ (orthography) has the value ‘upperInitial’; the attribute ‘kind’ has the value ‘word’.
6.2.2 Token Types
In the default set of rules, the following kinds of Token and SpaceToken are possible:
Word
A word is defined as any set of contiguous upper or lowercase letters, including a hyphen (but no other forms of punctuation). A word also has the attribute ‘orth’, for which four values are defined:
-
upperInitial - initial letter is uppercase, rest are lowercase
-
allCaps - all uppercase letters
-
lowerCase - all lowercase letters
-
mixedCaps - any mixture of upper and lowercase letters not included in the above categories
Number
A number is defined as any combination of consecutive digits. There are no subdivisions of numbers.
Symbol
Two types of symbol are defined: currency symbol (e.g. ‘$’, ‘£’) and symbol (e.g. ‘&’,
‘’). These are represented by any number of consecutive currency or other symbols
(respectively).
Punctuation
Three types of punctuation are defined: start_punctuation (e.g. ‘(’), end_punctuation (e.g. ‘)’), and other punctuation (e.g. ‘:’). Each punctuation symbol is a separate token.
SpaceToken
White spaces are divided into two types of SpaceToken - space and control - according to whether they are pure space characters or control characters. Any contiguous (and homogeneous) set of space or control characters is defined as a SpaceToken.
The above description applies to the default tokeniser. However, alternative tokenisers can be created if necessary. The choice of tokeniser is then determined at the time of text processing.
6.2.3 English Tokeniser [#]
The English Tokeniser is a processing resource that comprises a normal tokeniser and a JAPE transducer (see Chapter 8). The transducer has the role of adapting the generic output of the tokeniser to the requirements of the English part-of-speech tagger. One such adaptation is the joining together in one token of constructs like “ ’30s”, “ ’Cause”, “ ’em”, “ ’N”, “ ’S”, “ ’s”, “ ’T”, “ ’d”, “ ’ll”, “ ’m”, “ ’re”, “ ’til”, “ ve”, etc. Another task of the JAPE transducer is to convert negative constructs like “don’t” from three tokens (“don”, “ ’ “ and “t”) into two tokens (“do” and “n’t”).
The English Tokeniser should always be used on English texts that need to be processed afterwards by the POS Tagger.
6.3 Gazetteer [#]
The role of the gazetteer is to identify entity names in the text based on lists. The ANNIE gazetteer is described here, and also covered in Chapter 13 in Section 13.2.
The gazetteer lists used are plain text files, with one entry per line. Each list represents a set of names, such as names of cities, organisations, days of the week, etc.
Below is a small section of the list for units of currency:
Ecu European Currency Units FFr Fr German mark German marks New Taiwan dollar New Taiwan dollars NT dollar NT dollars
An index file (lists.def) is used to access these lists; for each list, a major type is specified and, optionally, a minor type. It is also possible to include a language in the same way (fourth column), where lists for different languages are used, though ANNIE is only concerned with monolingual recognition. By default, the Gazetteer PR creates a Lookup annotation for every gazetteer entry it finds in the text. One can also specify an annotation type (fifth column) specific to an individual list. In the example below, the first column refers to the list name, the second column to the major type, and the third to the minor type.
These lists are compiled into finite state machines. Any text tokens that are matched by these machines will be annotated with features specifying the major and minor types. Grammar rules then specify the types to be identified in particular circumstances. Each gazetteer list should reside in the same directory as the index file.
currency_prefix.lst:currency_unit:pre_amount currency_unit.lst:currency_unit:post_amount date.lst:date:specific day.lst:date:day
So, for example, if a specific day needs to be identified, the minor type ‘day’ should be specified in the grammar, in order to match only information about specific days; if any kind of date needs to be identified,the major type ‘date’ should be specified, to enable tokens annotated with any information about dates to be identified. More information about this can be found in the following section.
In addition, the gazetteer allows arbitrary feature values to be associated with particular entries in a single list. ANNIE does not use this capability, but to enable it for your own gazetteers, set the optional gazetteerFeatureSeparator parameter to a single character (or an escape sequence such as \t or \uNNNN) when creating a gazetteer. In this mode, each line in a .lst file can have feature values specified, for example, with the following entry in the index file:
software_company.lst:company:software
the following software_company.lst:
Red Hat&stockSymbol=RHAT Apple Computer&abbrev=Apple&stockSymbol=AAPL Microsoft&abbrev=MS&stockSymbol=MSFT
and gazetteerFeatureSeparator set to &, the gazetteer will annotate Red Hat as a Lookup with features majorType=company, minorType=software and stockSymbol=RHAT. Note that you do not have to provide the same features for every line in the file, in particular it is possible to provide extra features for some lines in the list but not others.
Here is a full list of the parameters used by the Default Gazetteer:
Init-time parameters
-
listsURL
-
A URL pointing to the index file (usually lists.def) that contains the list of pattern lists.
-
encoding
-
The character encoding to be used while reading the pattern lists.
-
gazetteerFeatureSeparator
-
The character used to add arbitrary features to gazetteer entries. See above for an example.
-
caseSensitive
-
Should the gazetteer be case sensitive during matching.
Run-time parameters
-
document
-
The document to be processed.
-
annotationSetName
-
The name for annotation set where the resulting Lookup annotations will be created.
-
wholeWordsOnly
-
Should the gazetteer only match whole words? If set to true, a string segment in the input document will only be matched if it is bordered by characters that are not letters, non spacing marks, or combining spacing marks (as identified by the Unicode standard).
-
longestMatchOnly
-
Should the gazetteer only match the longest possible string starting from any position. This parameter is only relevant when the list of lookups contains proper prefixes of other entries (e.g when both ‘Dell’ and ‘Dell Europe’ are in the lists). The default behaviour (when this parameter is set to true) is to only match the longest entry, ‘Dell Europe’ in this example. This is the default GATE gazetteer behaviour since version 2.0. Setting this parameter to false will cause the gazetteer to match all possible prefixes.
6.4 Sentence Splitter [#]
The sentence splitter is a cascade of finite-state transducers which segments the text into sentences. This module is required for the tagger. The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.
Each sentence is annotated with the type ‘Sentence’. Each sentence break (such as a full stop) is also given a ‘Split’ annotation. It has a feature ‘kind’ with two possible values: ‘internal’ for any combination of exclamation and question mark or one to four dots and ‘external’ for a newline.
The sentence splitter is domain and application-independent.
There is an alternative ruleset for the Sentence Splitter which considers newlines and carriage returns differently. In general this version should be used when a new line on the page indicates a new sentence). To use this alternative version, simply load the main-single-nl.jape from the default location instead of main.jape (the default file) when asked to select the location of the grammar file to be used.
6.5 RegEx Sentence Splitter [#]
The RegEx sentence splitter is an alternative to the standard ANNIE Sentence Splitter. Its main aim is to address some performance issues identified in the JAPE-based splitter, mainly do to with improving the execution time and robustness, especially when faced with irregular input.
As its name suggests, the RegEx splitter is based on regular expressions, using the default Java implementation.
The new splitter is configured by three files containing (Java style, see http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html) regular expressions, one regex per line. The three different files encode patterns for:
-
internal splits
-
sentence splits that are part of the sentence, such as sentence ending punctuation;
-
external splits
-
sentence splits that are NOT part of the sentence, such as 2 consecutive new lines;
-
non splits
-
text fragments that might be seen as splits but they should be ignored (such as full stops occurring inside abbreviations).
The new splitter comes with an initial set of patterns that try to emulate the behaviour of the original splitter (apart from the situations where the original one was obviously wrong, like not allowing sentences to start with a number).
Here is a full list of the parameters used by the RegEx Sentence Splitter:
Init-time parameters
-
encoding
-
The character encoding to be used while reading the pattern lists.
-
externalSplitListURL
-
URL for the file containing the list of external split patterns;
-
internalSplitListURL
-
URL for the file containing the list of internal split patterns;
-
nonSplitListURL
-
URL for the file containing the list of non split patterns;
Run-time parameters
-
document
-
The document to be processed.
-
outputASName
-
The name for annotation set where the resulting Split and Sentence annotations will be created.
6.6 Part of Speech Tagger [#]
The tagger [Hepple 00] is a modified version of the Brill tagger, which produces a part-of-speech tag as an annotation on each word or symbol. The list of tags used is given in Appendix G. The tagger uses a default lexicon and ruleset (the result of training on a large corpus taken from the Wall Street Journal). Both of these can be modified manually if necessary. Two additional lexicons exist - one for texts in all uppercase (lexicon_cap), and one for texts in all lowercase (lexicon_lower). To use these, the default lexicon should be replaced with the appropriate lexicon at load time. The default ruleset should still be used in this case.
The ANNIE Part-of-Speech tagger requires the following parameters.
-
encoding - encoding to be used for reading rules and lexicons (init-time)
-
lexiconURL - The URL for the lexicon file (init-time)
-
rulesURL - The URL for the ruleset file (init-time)
-
document - The document to be processed (run-time)
-
inputASName - The name of the annotation set used for input (run-time)
-
outputASName - The name of the annotation set used for output (run-time). This is an optional parameter. If user does not provide any value, new annotations are created under the default annotation set.
-
baseTokenAnnotationType - The name of the annotation type that refers to Tokens in a document (run-time, default = Token)
-
baseSentenceAnnotationType - The name of the annotation type that refers to Sentences in a document (run-time, default = Sentence).
-
outputAnnotationType - POS tags are added as category features on the annotations of type ‘outputAnnotationType’ (run-time, default = Token)
-
posTagAllTokens - If set to false, only Tokens within each baseSentenceAnnotationType will be POS tagged (run-time, default = true).
-
failOnMissingInputAnnotations - if set to false, the PR will not fail with an ExecutionException if no input Annotations are found and instead only log a single warning message per session and a debug message per document that has no input annotations (run-time, default = true).
If - (inputASName == outputASName) AND (outputAnnotationType == baseTokenAnnotationType)
then - New features are added on existing annotations of type ‘baseTokenAnnotationType’.
otherwise - Tagger searches for the annotation of type ‘outputAnnotationType’ under the ‘outputASName’ annotation set that has the same offsets as that of the annotation with type ‘baseTokenAnnotationType’. If it succeeds, it adds new feature on a found annotation, and otherwise, it creates a new annotation of type ‘outputAnnotationType’ under the ‘outputASName’ annotation set.
6.7 Semantic Tagger [#]
ANNIE’s semantic tagger is based on the JAPE language – see Chapter 8. It contains rules which act on annotations assigned in earlier phases, in order to produce outputs of annotated entities.
The default annotation types, features and possible values produced by ANNIE are based on the original MUC entity types, and are as follows:
-
Person
-
gender: male, female
-
-
Location
-
locType: region, airport, city, country, county, province, other
-
-
Organization
-
orgType: company, department, government, newspaper, team, other
-
-
Money
-
Percent
-
Date
-
kind: date, time, dateTime
-
-
Address
-
kind: email, url, phone, postcode, complete, ip, other
-
-
Identifier
-
Unknown
Note that some of these feature values are generated automatically from the gazetteer lists, so if you alter the gazetteer list definition file, these could change. Note also that other annotations, features and values are also created by ANNIE which may be left for debugging purposes: for example, most annotations have a rule feature that gives information about which rule(s) fired to create the annotation. The Unknown annotation type is used by the Orthomatcher module (see 6.8) and consists of any proper noun not already identified.
6.8 Orthographic Coreference (OrthoMatcher) [#]
(Note: this component was previously known as a ‘NameMatcher’.)
The Orthomatcher module adds identity relations between named entities found by the semantic tagger, in order to perform coreference. It does not find new named entities as such, but it may assign a type to an unclassified proper name (an Unknown annotation), using the type of a matching name.
The matching rules are only invoked if the names being compared are both of the same type, i.e. both already tagged as (say) organisations, or if one of them is classified as ‘unknown’. This prevents a previously classified name from being recategorised.
6.8.1 GATE Interface
Input – entity annotations, with an id attribute.
Output – matches attributes added to the existing entity annotations.
6.8.2 Resources
A lookup table of aliases is used to record non-matching strings which represent the same entity, e.g. ‘IBM’ and ‘Big Blue’, ‘Coca-Cola’ and ‘Coke’. There is also a table of spurious matches, i.e. matching strings which do not represent the same entity, e.g. ‘BT Wireless’ and ‘BT Cellnet’ (which are two different organizations). The list of tables to be used is a load time parameter of the orthomatcher: a default list is set but can be changed as necessary.
6.8.3 Processing
The wrapper builds an array of the strings, types and IDs of all name annotations, which is then passed to a string comparison function for pairwise comparisons of all entries.
6.9 Pronominal Coreference [#]
The pronominal coreference module performs anaphora resolution using the JAPE grammar formalism. Note that this module is not automatically loaded with the other ANNIE modules, but can be loaded separately as a Processing Resource. The main module consists of three submodules:
-
quoted text module
-
pleonastic it module
-
pronominal resolution module
The first two modules are helper submodules for the pronominal one, because they do not perform anything related to coreference resolution except the location of quoted fragments and pleonastic it occurrences in text. They generate temporary annotations which are used by the pronominal submodule (such temporary annotations are removed later).
The main coreference module can operate successfully only if all ANNIE modules were already executed. The module depends on the following annotations created from the respective ANNIE modules:
-
Token (English Tokenizer)
-
Sentence (Sentence Splitter)
-
Split (Sentence Splitter)
-
Location (NE Transducer, OrthoMatcher)
-
Person (NE Transducer, OrthoMatcher)
-
Organization (NE Transducer, OrthoMatcher)
For each pronoun (anaphor) the coreference module generates an annotation of type ‘Coreference’ containing two features:
-
antecedent offset - this is the offset of the starting node for the annotation (entity) which is proposed as the antecedent, or null if no antecedent can be proposed.
-
matches - this is a list of annotation IDs that comprise the coreference chain comprising this anaphor/antecedent pair.
6.9.1 Quoted Speech Submodule
The quoted speech submodule identifies quoted fragments in the text being analysed. The identified fragments are used by the pronominal coreference submodule for the proper resolution of pronouns such as I, me, my, etc. which appear in quoted speech fragments. The module produces ‘Quoted Text’ annotations.
The submodule itself is a JAPE transducer which loads a JAPE grammar and builds an FSM over it. The FSM is intended to match the quoted fragments and generate appropriate annotations that will be used later by the pronominal module.
The JAPE grammar consists of only four rules, which create temporary annotations for all punctuation marks that may enclose quoted speech, such as ", ’, ‘, etc. These rules then try to identify fragments enclosed by such punctuation. Finally all temporary annotations generated during the processing, except the ones of type ‘Quoted Text’, are removed (because no other module will need them later).
6.9.2 Pleonastic It Submodule
The pleonastic it submodule matches pleonastic occurrences of ‘it’. Similar to the quoted speech submodule, it is a JAPE transducer operating with a grammar containing patterns that match the most commonly observed pleonastic it constructs.
6.9.3 Pronominal Resolution Submodule
The main functionality of the coreference resolution module is in the pronominal resolution submodule. This uses the result from the execution of the quoted speech and pleonastic it submodules. The module works according to the following algorithm:
-
Preprocess the current document. This step locates the annotations that the submodule need (such as Sentence, Token, Person, etc.) and prepares the appropriate data structures for them.
-
For each pronoun do the following:
-
inspect the proper appropriate context for all candidate antecedents for this kind of pronoun;
-
choose the best antecedent (if any);
-
-
Create the coreference chains from the individual anaphor/antecedent pairs and the coreference information supplied by the OrthoMatcher (this step is performed from the main coreference module).
6.9.4 Detailed Description of the Algorithm
Full details of the pronominal coreference algorithm are as follows.
Preprocessing
The preprocessing task includes the following subtasks:
-
Identifying the sentences in the document being processed. The sentences are identified with the help of the Sentence annotations generated from the Sentence Splitter. For each sentence a data structure is prepared that contains three lists. The lists contain the annotations for the person/organization/location named entities appearing in the sentence. The named entities in the sentence are identified with the help of the Person, Location and Organization annotations that are already generated from the Named Entity Transducer and the OrthoMatcher.
-
The gender of each person in the sentence is identified and stored in a global data structure. It is possible that the gender information is missing for some entities - for example if only the person family name is observed then the Named Entity transducer will be unable to deduce the gender. In such cases the list with the matching entities generated by the OrhtoMatcher is inspected and if some of the orthographic matches contains gender information it is assigned to the entity being processed.
-
The identified pleonastic it occurrences are stored in a separate list. The ‘Pleonastic It’ annotations generated from the pleonastic submodule are used for the task.
-
For each quoted text fragment, identified by the quoted text submodule, a special structure is created that contains the persons and the 3rd person singular pronouns such as ‘he’ and ‘she’ that appear in the sentence containing the quoted text, but not in the quoted text span (i.e. the ones preceding and succeeding the quote).
Pronoun Resolution
This task includes the following subtasks:
Retrieving all the pronouns in the document. Pronouns are represented as annotations of type ‘Token’ with feature ‘category’ having value ‘PRP$’ or ‘PRP’. The former classifies possessive adjectives such as my, your, etc. and the latter classifies personal, reflexive etc. pronouns. The two types of pronouns are combined in one list and sorted according to their offset in the text.
For each pronoun in the list the following actions are performed:
-
If the pronoun is ‘it’, then the module performs a check to determine if this is a pleonastic occurrence. If it is, then no further attempt for resolution is made.
-
The proper context is determined. The context size is expressed in the number of sentences it will contain. The context always includes the current sentence (the one containing the pronoun), the preceding sentence and zero or more preceding sentences.
-
Depending on the type of pronoun, a set of candidate antecedents is proposed. The candidate set includes the named entities that are compatible with this pronoun. For example if the current pronoun is she then only the Person annotations with ‘gender’ feature equal to ‘female’ or ‘unknown’ will be considered as candidates.
-
From all candidates, one is chosen according to evaluation criteria specific for the pronoun.
Coreference Chain Generation
This step is actually performed by the main module. After executing each of the submodules on the current document, the coreference module follows the steps:
-
Retrieves the anaphor/antecedent pairs generated from them.
-
For each pair, the orthographic matches (if any) of the antecedent entity is retrieved and then extended with the anaphor of the pair (i.e. the pronoun). The result is the coreference chain for the entity. The coreference chain contains the IDs of the annotations (entities) that co-refer.
-
A new Coreference annotation is created for each chain. The annotation contains a single feature ‘matches’ whose value is the coreference chain (the list with IDs). The annotations are exported in a pre-specified annotation set.
The resolution of she, her, her$, he, him, his, herself and himself are similar because an analysis of a corpus showed that these pronouns are related to their antecedents in a similar manner. The characteristics of the resolution process are:
-
Context inspected is not very big - cases where the antecedent is found more than 3 sentences back from the anaphor are rare.
-
Recency factor is heavily used - the candidate antecedents that appear closer to the anaphor in the text are scored better.
-
Anaphora have higher priority than cataphora. If there is an anaphoric candidate and a cataphoric one, then the anaphoric one is preferred, even if the recency factor scores the cataphoric candidate better.
The resolution process performs the following steps:
-
Inspect the context of the anaphor for candidate antecedents. Every Person annotation is consider to be a candidate. Cases where she/her refers to inanimate entity (ship for example) are not handled.
-
For each candidate perform a gender compatibility check - only candidates having ‘gender’ feature equal to ‘unknown’ or compatible with the pronoun are considered for further evaluation.
-
Evaluate each candidate with the best candidate so far. If the two candidates are anaphoric for the pronoun then choose the one that appears closer. The same holds for the case where the two candidates are cataphoric relative to the pronoun. If one is anaphoric and the other is cataphoric then choose the former, even if the latter appears closer to the pronoun.
Resolution of ‘it’, ‘its’, ‘itself’
This set of pronouns also shares many common characteristics. The resolution process contains certain differences with the one for the previous set of pronouns. Successful resolution for it, its, itself is more difficult because of the following factors:
-
There is no gender compatibility restriction. In the case in which there are several candidates in the context, the gender compatibility restriction is very useful for rejecting some of the candidates. When no such restriction exists, and with the lack of any syntactic or ontological information about the entities in the context, the recency factor plays the major role in choosing the best antecedent.
-
The number of nominal antecedents (i.e. entities that are not referred by name) is much higher compared to the number of such antecedents for she, he, etc. In this case trying to find an antecedent only amongst named entities degrades the precision a lot.
Resolution of ‘I’, ‘me’, ‘my’, ‘myself’
Resolution of these pronouns is dependent on the work of the quoted speech submodule. One important difference from the resolution process of other pronouns is that the context is not measured in sentences but depends solely on the quote span. Another difference is that the context is not contiguous - the quoted fragment itself is excluded from the context, because it is unlikely that an antecedent for I, me, etc. appears there. The context itself consists of:
-
the part of the sentence where the quoted fragment originates, that is not contained in the quote - i.e. the text prior to the quote;
-
the part of the sentence where the quoted fragment ends, that is not contained in the quote - i.e. the text following the quote;
-
the part of the sentence preceding the sentence where the quote originates, which is not included in other quote.
It is worth noting that contrary to other pronouns, the antecedent for I, me, my and myself is most often cataphoric or if anaphoric it is not in the same sentence with the quoted fragment.
The resolution algorithm consists of the following steps:
-
Locate the quoted fragment description that contains the pronoun. If the pronoun is not contained in any fragment then return without proposing an antecedent.
-
Inspect the context for the quoted fragment (as defined above) for candidate antecedents. Candidates are considered annotations of type Pronoun or annotations of type Token with features category = ‘PRP’, string = ‘she’ or category = ‘PRP’, string = ‘he’.
-
Try to locate a candidate in the text succeeding the quoted fragment (first pattern). If more than one candidate is present, choose the closest to the end of the quote. If a candidate is found then propose it as antecedent and exit.
-
Try to locate a candidate in the text preceding the quoted fragment (third pattern). Choose the closest one to the beginning of the quote. If found then set as antecedent and exit.
-
Try to locate antecedents in the unquoted part of the sentence preceding the sentence where the quote starts (second pattern). Give preference to the one closest to the end of the quote (if any) in the preceding sentence or closest to the sentence beginning.
6.10 A Walk-Through Example [#]
Let us take an example of a 3-stage procedure using the tokeniser, gazetteer and named-entity grammar. Suppose we wish to recognise the phrase ‘800,000 US dollars’ as an entity of type ‘Number’, with the feature ‘money’.
First of all, we give an example of a grammar rule (and corresponding macros) for money, which would recognise this type of pattern.
Macro: MILLION_BILLION ({Token.string == "m"}| {Token.string == "million"}| {Token.string == "b"}| {Token.string == "billion"} ) Macro: AMOUNT_NUMBER ({Token.kind == number} (({Token.string == ","}| {Token.string == "."}) {Token.kind == number})* (({SpaceToken.kind == space})? (MILLION_BILLION)?) ) Rule: Money1 // e.g. 30 pounds ( (AMOUNT_NUMBER) (SpaceToken.kind == space)? ({Lookup.majorType == currency_unit}) ) :money --> :money.Number = {kind = "money", rule = "Money1"}
6.10.1 Step 1 - Tokenisation
The tokeniser separates this phrase into the following tokens. In general, a word is comprised of any number of letters of either case, including a hyphen, but nothing else; a number is composed of any sequence of digits; punctuation is recognised individually (each character is a separate token), and any number of consecutive spaces and/or control characters are recognised as a single spacetoken.
Token, string = ‘800’, kind = number, length = 3 Token, string = ‘,’, kind = punctuation, length = 1 Token, string = ‘000’, kind = number, length = 3 SpaceToken, string = ‘ ’, kind = space, length = 1 Token, string = ‘US’, kind = word, length = 2, orth = allCaps SpaceToken, string = ‘ ’, kind = space, length = 1 Token, string = ‘dollars’, kind = word, length = 7, orth = lowercase
6.10.2 Step 2 - List Lookup
The gazetteer lists are then searched to find all occurrences of matching words in the text. It finds the following match for the string ‘US dollars’:
Lookup, minorType = post_amount, majorType = currency_unit
6.10.3 Step 3 - Grammar Rules
The grammar rule for money is then invoked. The macro MILLION_BILLION recognises any of the strings ‘m’, ‘million’, ‘b’, ‘billion’. Since none of these exist in the text, it passes onto the next macro. The AMOUNT_NUMBER macro recognises a number, optionally followed by any number of sequences of the form‘dot or comma plus number’, followed by an optional space and an optional MILLION_BILLION. In this case, ‘800,000’ will be recognised. Finally, the rule Money1 is invoked. This recognises the string identified by the AMOUNT_NUMBER macro, followed by an optional space, followed by a unit of currency (as determined by the gazetteer). In this case, ‘US dollars’ has been identified as a currency unit, so the rule Money1 recognises the entire string ‘800,000 US dollars’. Following the rule, it will be annotated as a Number entity of type Money:
Number, kind = money, rule = Money1
Part II
GATE for Advanced Users [#]
Chapter 7
GATE Embedded [#]
7.1 Quick Start with GATE Embedded [#]
Embedding GATE-based language processing in other applications using GATE Embedded (the GATE API) is straightforward:
-
add the GATE libraries to your application’s classpath.
-
if you use a build tool with dependency management, such as Maven or Gradle, add a dependency on the right version of uk.ac.gate:gate-core – this is the recommended way to build against the GATE APIs.
-
if you can’t use a dependency manager, you can instead add all the JAR files from the lib directory of a GATE installation to your compile classpath in your build tool.
-
-
initialise GATE with gate.Gate.init();
-
program to the framework API.
For example, this code will create the default ANNIE extraction system, the same as the “load ANNIE” button in GATE Developer:
2 Gate.init();
3
4 // load the ANNIE plugin
5 Plugin anniePlugin = new Plugin.Maven(
6 "uk.ac.gate.plugins", "annie", gate.Main.version);
7 Gate.getCreoleRegister().registerPlugin(anniePlugin);
8
9 // load ANNIE application from inside the plugin
10 SerialAnalyserController controller = (SerialAnalyserController)
11 PersistenceManager.loadObjectFromUrl(new ResourceReference(
12 anniePlugin, "resources/" + ANNIEConstants.DEFAULT_FILE)
13 .toURL());
If you want to use resources from any plugins, you need to load the plugins before calling createResource:
2
3 // need Tools plugin for the Morphological analyser
4 Gate.getCreoleRegister().registerPlugin(new Plugin.Maven(
5 "uk.ac.gate.plugins", "tools", gate.Main.version));
6
7 ...
8
9 ProcessingResource morpher = (ProcessingResource)
10 Factory.createResource("gate.creole.morph.Morph");
Instead of creating your processing resources individually using the Factory, you can create your application in GATE Developer, save it using the ‘save application state’ option (see Section 3.9.3), and then load the saved state from your code. This will automatically reload any plugins that were loaded when the state was saved, you do not need to load them manually.
2
3 CorpusController controller = (CorpusController)
4 PersistenceManager.loadObjectFromFile(new File("savedState.xgapp"));
There are many examples of using GATE Embedded available at:
http://gate.ac.uk/wiki/code-repository/.
See Section 2.3 for details of the system properties GATE uses to find its configuration files.
7.2 Resource Management in GATE Embedded [#]
As outlined earlier, GATE defines three different types of resources:
-
Language Resources
-
: (LRs) entities that hold linguistic data.
-
Processing Resources
-
: (PRs) entities that process data.
-
Visual Resources
-
: (VRs) components used for building graphical interfaces.
These resources are collectively named CREOLE1 resources.
All CREOLE resources have some associated meta-data in the form of annotations on the resource class and some of its methods. The most important role of that meta-data is to specify the set of parameters that a resource understands, which of them are required and which not, if they have default values and what those are. See Section 4.7 for full details of the configuration mechanism.
All resource types have creation-time parameters that are used during the initialisation phase. Processing Resources also have run-time parameters that get used during execution (see Section 7.5 for more details).
Controllers are used to define GATE applications and have the role of controlling the execution flow (see Section 7.6 for more details).
This section describes how to create and delete CREOLE resources as objects in a running Java virtual machine. This process involves using GATE’s Factory class2, and, in the case of LRs, may also involve using a DataStore.
CREOLE resources are Java Beans; creation of a resource object involves using a default constructor, then setting parameters on the bean, then calling an init() method. The Factory takes care of all this, makes sure that the GATE Developer GUI is told about what is happening (when GUI components exist at runtime), and also takes care of restoring LRs from DataStores. A programmer using GATE Embedded should never call the constructor of a resource: always use the Factory!
Creating a resource involves providing the following information:
-
fully qualified class name for the resource. This is the only required value. For all the rest, defaults will be used if actual values are not provided.
-
values for the creation time parameters.†
-
initial values for resource features.† For an explanation on features see Section 7.4.2.
-
a name for the new resource;
† Parameters and features need to be provided in the form of a GATE Feature Map which is essentially a java Map (java.util.Map) implementation, see Section 7.4.2 for more details on Feature Maps.
Creating a resource via the Factory involves passing values for any create-time parameters that require setting to the Factory’s createResource method. If no parameters are passed, the defaults are used. So, for example, the following code creates a default ANNIE part-of-speech tagger:
2 "uk.ac.gate.plugins", "annie", gate.Main.version));
3FeatureMap params = Factory.newFeatureMap(); //empty map:default params
4ProcessingResource tagger = (ProcessingResource)
5 Factory.createResource("gate.creole.POSTagger", params);
Note that if the resource created here had any parameters that were both mandatory and had no default value, the createResource call would throw an exception. In the case of the POS tagger, all the required parameters have default values so no params need to be passed in.
When creating a Document, however, the URL of the source for the document must be provided3. For example:
2FeatureMap params = Factory.newFeatureMap();
3params.put("sourceUrl", u);
4Document doc = (Document)
5 Factory.createResource("gate.corpora.DocumentImpl", params);
Note that the document created here is transient: when you quit the JVM the document will no longer exist. If you want the document to be persistent, you need to store it in a DataStore (see Section 7.4.5).
Apart from createResource() methods with different signatures, Factory also provides some shortcuts for common operations, listed in table 7.1.
Method | Purpose |
newFeatureMap() | Creates a new Feature Map (as used in the example above). |
newDocument(String content) | Creates a new GATE Document starting from a String value that will be used to generate the document content. |
newDocument(URL sourceUrl) | Creates a new GATE Document using the text pointed by an URL to generate the document content. |
newDocument(URL sourceUrl, String encoding) | Same as above but allows the specification of an encoding to be used while downloading the document content. |
newCorpus(String name) | creates a new GATE Corpus with a specified name. |
GATE maintains various data structures that allow the retrieval of loaded resources. When a resource is no longer required, it needs to be removed from those structures in order to remove all references to it, thus making it a candidate for garbage collection. This is achieved using the deleteResource(Resource res) method on Factory.
Simply removing all references to a resource from the user code will NOT be enough to make the resource collect-able. Not calling Factory.deleteResource() will lead to memory leaks!
7.3 Using CREOLE Plugins [#]
As shown in the examples above, in order to use a CREOLE resource the relevant CREOLE plugin must be loaded. Processing Resources, Visual Resources and Language Resources other than Document, Corpus and DataStore all require that the appropriate plugin is first loaded. When using Document, Corpus or DataStore, you do not need to first load a plugin. The following API calls listed in table 7.2 are relevant to working with CREOLE plugins.
Class gate.Gate
| |
Method | Purpose |
public static void addKnownPlugin(Plugin plugin) | adds the plugin to the list of known plugins. |
public static void removeKnownPlugin(Plugin plugin) | tells the system to ‘forget’ about one previously known directory. If the specified plugin was loaded, it will be unloaded as well - i.e. all the metadata relating to resources defined by this plugin will be removed from memory. |
public static void addAutoloadPlugin(Plugin plugin) | adds a new plugin to the list of plugins that are loaded automatically at start-up. |
public static void removeAutoloadPlugin(Plugin plugin) | tells the system to remove a plugin from the list of plugins that are loaded automatically at system start-up. This will be reflected in the user’s configuration data file. |
Class gate.CreoleRegister
| |
public void registerPlugin(Plugin plugin) | loads a new CREOLE plugin. The new plugin is added to the list of known plugins if not already there. |
public void unregisterPlugin(Plugin plugin) | unloads a loaded CREOLE plugin. |
There are several different subclasses of Plugin that can be passed to these methods. The most common one is Plugin.Maven, as seen in the examples above, which is a plugin that is a single JAR file specified via its group:artifact:version “coordinates”, and which is downloaded from a Maven repository at runtime by GATE the first time the plugin is loaded. The vast majority of standard GATE plugins are of this type. To load version 8.5 of the ANNIE plugin, for example, you would use:
2 "uk.ac.gate.plugins", "annie", "8.5"));
By default GATE looks in the Central Repository and in the GATE repository (http://repo.gate.ac.uk/content/groups/public/, where we deploy snapshot builds of the standard plugins), plus any repositories declared in active profiles in the normal Maven settings.xml file. Mirror and proxy settings from this file are also respected.
In addition to Maven plugins, GATE still supports the style of plugins used in GATE version 8.4.1 and earlier where the plugin is a directory on disk which contains a creole.xml configuration file and optionally one or more JAR files containing the compiled classes of the plugin’s CREOLE resources. These plugins are represented by the class Plugin.Directory, with a URL pointing to the directory that contains the creole.xml file:
2 new URL("file:/home/example/my-plugins/FishCounter/"));
Finally, if you are writing a GATE Embedded application and have a single resource class that will only be used from your embedded code (and so does not need to be distributed as a complete plugin), and all the configuration for that resource is provided as Java annotations on the class, then it is possible to register the class as a special type of Plugin called a “component”:
Note that components cannot be registered this way in the developer GUI, and cannot be included in saved application states (see section 7.9 below).
7.4 Language Resources [#]
This section describes the implementation of documents and corpora in GATE.
7.4.1 GATE Documents
Documents are modelled as content plus annotations (see Section 7.4.4) plus features (see Section 7.4.2).
The content of a document can be any implementation of the gate.DocumentContent interface; the features are <attribute, value> pairs stored a Feature Map. Attributes are String values while the values can be any Java object.
The annotations are grouped in sets (see section 7.4.3). A document has a default (anonymous) annotations set and any number of named annotations sets.
Documents are defined by the gate.Document interface and there is also a provided implementation:
-
: transient document. Can be stored persistently through Java serialisation.
Main Document functions are presented in table 7.3.
Content Manipulation
| |
Method | Purpose |
DocumentContent getContent() | Gets the Document content. |
void edit(Long start, Long end, DocumentContent replacement) | Modifies the Document content. |
void setContent(DocumentContent newContent) | Replaces the entire content. |
Annotations Manipulation
| |
Method | Purpose |
public AnnotationSet getAnnotations() | Returns the default annotation set. |
public AnnotationSet getAnnotations(String name) | Returns a named annotation set. |
public Map getNamedAnnotationSets() | Returns all the named annotation sets. |
void removeAnnotationSet(String name) | Removes a named annotation set. |
Input Output
| |
String toXml() | Serialises the Document in XML format. |
String toXml(Set aSourceAnnotationSet, boolean includeFeatures) | Generates XML from a set of annotations only, trying to preserve the original format of the file used to create the document. |
7.4.2 Feature Maps [#]
All CREOLE resources as well as the Controllers and the annotations can have attached meta-data in the form of Feature Maps.
A Feature Map is a Java Map (i.e. it implements the java.util.Map interface) and holds <attribute-name, attribute-value> pairs. The attribute names are Strings while the values can be any Java Objects.
The use of non-Serialisable objects as values is strongly discouraged.
Feature Maps are created using the gate.Factory.newFeatureMap() method.
The actual implementation for FeatureMaps is provided by the gate.util.SimpleFeatureMapImpl class.
Objects that have features in GATE implement the gate.util.FeatureBearer interface which has only the two accessor methods for the object features: FeatureMap getFeatures() and void setFeatures(FeatureMap features).
etting a particular feature from an object
7.4.3 Annotation Sets [#]
A GATE document can have one or more annotation layers — an anonymous one, (also called default), and as many named ones as necessary.
An annotation layer is organised as a Directed Acyclic Graph (DAG) on which the nodes are particular locations —anchors— in the document content and the arcs are made out of annotations reaching from the location indicated by the start node to the one pointed by the end node (see Figure 7.1 for an illustration). Because of the graph metaphor, the annotation layers are also called annotation graphs. In terms of Java objects, the annotation layers are represented using the Set paradigm as defined by the collections library and they are hence named annotation sets. The terms of annotation layer, graph and set are interchangeable and refer to the same concept when used in this book.
An annotation set holds a number of annotations and maintains a series of indices in order to provide fast access to the contained annotations.
The GATE Annotation Sets are defined by the gate.AnnotationSet interface and there is a default implementation provided:
-
annotation set implementation used by transient documents.
The annotation sets are created by the document as required. The first time a particular annotation set is requested from a document it will be transparently created if it doesn’t exist.
Tables 7.4 and 7.5 list the most used Annotation Set functions.
Annotations Manipulation
| |
Method | Purpose |
Integer add(Long start, Long end, String type, FeatureMap features) | Creates a new annotation between two offsets, adds it to this set and returns its id. |
Integer add(Node start, Node end, String type, FeatureMap features) | Creates a new annotation between two nodes, adds it to this set and returns its id. |
boolean remove(Object o) | Removes an annotation from this set. |
Nodes
| |
Method | Purpose |
Node firstNode() | Gets the node with the smallest offset. |
Node lastNode() | Gets the node with the largest offset. |
Node nextNode(Node node) | Get the first node that is relevant for this annotation set and which has the offset larger than the one of the node provided. |
Set implementation
| |
Iterator iterator() |
|
int size() |
|
Searching
| |
AnnotationSet get(Long offset) | Select annotations by offset. This returns the set of annotations whose start node is the least such that it is greater than or equal to offset. If a positional index doesn’t exist it is created. If there are no nodes at or beyond the offset parameter then it will return null. |
AnnotationSet get(Long startOffset, Long endOffset) | Select annotations by offset. This returns the set of annotations that overlap totally or partially with the interval defined by the two provided offsets. The result will include all the annotations that either:
|
AnnotationSet get(String type) | Returns all annotations of the specified type. |
AnnotationSet get(Set types) | Returns all annotations of the specified types. |
AnnotationSet get(String type, FeatureMap constraints) | Selects annotations by type and features. |
Set getAllTypes() | Gets a set of java.lang.String objects representing all the annotation types present in this annotation set. |
AnnotationSet getContained(Long startOffset, Long endOffset) | Select annotations contained within an interval, i.e. |
AnnotationSet getCovering(String neededType, Long startOffset, Long endOffset) | Select annotations of the given type that completely span the range. |
terating from left to right over all annotations of a given type
2String type = "Person";
3//Get all person annotations
4AnnotationSet persSet = annSet.get(type);
5//Sort the annotations
6List persList = new ArrayList(persSet);
7Collections.sort(persList, new gate.util.OffsetComparator());
8//Iterate
9Iterator persIter = persList.iterator();
10while(persIter.hasNext()){
11...
12}
7.4.4 Annotations [#]
An annotation is a form of meta-data attached to a particular section of document content. The connection between the annotation and the content it refers to is made by means of two pointers that represent the start and end locations of the covered content. An annotation must also have a type (or a name) which is used to create classes of similar annotations, usually linked together by their semantics.
An Annotation is defined by:
-
start node
-
a location in the document content defined by an offset.
-
end node
-
a location in the document content defined by an offset.
-
type
-
a String value.
-
features
-
(see Section 7.4.2).
-
ID
-
an Integer value. All annotations IDs are unique inside an annotation set.
In GATE Embedded, annotations are defined by the gate.Annotation interface and implemented by the gate.annotation.AnnotationImpl class. Annotations exist only as members of annotation sets (see Section 7.4.3) and they should not be directly created by means of a constructor. Their creation should always be delegated to the containing annotation set.
7.4.5 GATE Corpora [#]
A corpus in GATE is a Java List (i.e. an implementation of java.util.List) of documents. GATE corpora are defined by the gate.Corpus interface and the following implementations are available:
-
gate.corpora.CorpusImpl
-
used for transient corpora.
-
gate.corpora.SerialCorpusImpl
-
used for persistent corpora that are stored in a serial datastore (i.e. as a directory in a file system).
Apart from implementation for the standard List methods, a Corpus also implements the methods in table 7.6.
Method | Purpose |
String getDocumentName(int index) | Gets the name of a document in this corpus. |
List getDocumentNames() | Gets the names of all the documents in this corpus. |
void populate(URL directory, FileFilter filter, String encoding, boolean recurseDirectories) | Fills this corpus with documents created on the fly from selected files in a directory. Uses a FileFilter to select which files will be used and which will be ignored. A simple file filter based on extensions is provided in the Gate distribution (gate.util.ExtensionFileFilter). |
void populate(URL singleConcatenatedFile, String documentRootElement, String encoding, int numberOfDocumentsToExtract, String documentNamePrefix, DocType documentType) | Fills the provided corpus with documents extracted from the provided single concatenated file. Uses the content between the start and end of the element as specified by documentRootElement for each document. The parameter documentType specifies if the resulting files are html, xml or of any other type. User can also restrict the number of documents to extract by providing the relevant value for numberOfDocumentsToExtract parameter. |
Creating a corpus from all XML files in a directory
Using a DataStore
Assuming that you have a DataStore already open called myDataStore, this code will ask the datastore to take over persistence of your document, and to synchronise the memory representation of the document with the disk storage:
Document persistentDoc = myDataStore.adopt(doc, mySecurity); myDataStore.sync(persistentDoc);
When you want to restore a document (or other LR) from a datastore, you make the same createResource call to the Factory as for the creation of a transient resource, but this time you tell it the datastore the resource came from, and the ID of the resource in that datastore:
2 SerialDataStore sds = new SerialDataStore(u.toString());
3 sds.open();
4
5 // getLrIds returns a list of LR Ids, so we get the first one
6 Object lrId = sds.getLrIds("gate.corpora.DocumentImpl").get(0);
7
8 // we need to tell the factory about the LR’s ID in the data
9 // store, and about which datastore it is in − we do this
10 // via a feature map:
11 FeatureMap features = Factory.newFeatureMap();
12 features.put(DataStore.LR_ID_FEATURE_NAME, lrId);
13 features.put(DataStore.DATASTORE_FEATURE_NAME, sds);
14
15 // read the document back
16 Document doc = (Document)
17 Factory.createResource("gate.corpora.DocumentImpl", features);
7.5 Processing Resources [#]
Processing Resources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or ngram modellers.
They are created using the GATE Factory in manner similar the Language Resources. Besides the creation-time parameters they also have a set of run-time parameters that are set by the system just before executing them.
Analysers are a particular type of processing resources in the sense that they always have a document and a corpus among their run-time parameters.
The most used methods for Processing Resources are presented in table 7.7
Method | Purpose |
void setParameterValue(String paramaterName, Object parameterValue) | Sets the value for a specified parameter. method inherited from gate.Resource |
void setParameterValues(FeatureMap parameters) | Sets the values for more parameters in one step. method inherited from gate.Resource |
Object getParameterValue(String paramaterName) | Gets the value of a named parameter of this resource. method inherited from gate.Resource |
Resource init() | Initialise this resource, and return it. method inherited from gate.Resource |
void reInit() | Reinitialises the processing resource. After calling this method the resource should be in the state it is after calling init. If the resource depends on external resources (such as rules files) then the resource will re-read those resources. If the data used to create the resource has changed since the resource has been created then the resource will change too after calling reInit(). |
void execute() | Starts the execution of this Processing Resource. |
void interrupt() | Notifies this PR that it should stop its execution as soon as possible. |
boolean isInterrupted() | Checks whether this PR has been interrupted since the last time its Executable.execute() method was called. |
7.6 Controllers [#]
Controllers are used to create GATE applications. A Controller handles a set of Processing Resources and can execute them following a particular strategy. GATE provides a series of serial controllers (i.e. controllers that run their PRs in sequence):
-
gate.creole.SerialController:
-
a serial controller that takes any kind of PRs.
-
gate.creole.SerialAnalyserController:
-
a serial controller that only accepts Language Analysers as member PRs.
-
gate.creole.ConditionalSerialController:
-
a serial controller that accepts all types of PRs and that allows the inclusion or exclusion of member PRs from the execution chain according to certain run-time conditions (currently features on the document being processed are used).
-
gate.creole.ConditionalSerialAnalyserController:
-
a serial controller that only accepts Language Analysers and that allows the conditional run of member PRs.
-
gate.creole.RealtimeCorpusController:
-
a SerialAnalyserController that allows you to specify graceful and timeout parameters (times in milliseconds). If processing for a document takes longer than the amount of time specified for graceful, then the controller will attempt to gracefully end it by sending an interrupt request to it. If the graceful parameter is ‘-1’ then no attempt to gracefully end it is made. If processing takes longer than the amount of time specified for the timeout parameter, it will be forcibly terminated and the controller will move on to the next document. The parameter suppressExceptions controls if time-outs and other exceptions will be suppressed or passed on to the caller: if this parameter is set to ‘true’, then any exception or a timeout will simply cause the controller to move on to the next document rather than failing the entire corpus processing. If the parameter is set to ‘false’ both time-outs and exceptions will be passed on as exceptions to the caller.
Additionally there is a scriptable controller provided by the Groovy plugin. See section 7.16.3 for details.
Creating an ANNIE application and running it over a corpus
2Plugin anniePlugin = new Plugin.Maven(
3 "uk.ac.gate.plugins", "annie", gate.Main.version);
4Gate.getCreoleRegister().registerPlugin(anniePlugin);
5
6// create a serial analyser controller to run ANNIE with
7SerialAnalyserController annieController =
8 (SerialAnalyserController) Factory.createResource(
9 "gate.creole.SerialAnalyserController",
10 Factory.newFeatureMap(),
11 Factory.newFeatureMap(), "ANNIE");
12
13// load each PR as defined in ANNIEConstants
14// Note this code is for demonstration purposes only,
15// in practice if you want to load the ANNIE app you
16// should use the PersistenceManager as shown at the
17// start of this chapter
18for(int i = 0; i < ANNIEConstants.PR_NAMES.length; i++) {
19 // use default parameters
20 FeatureMap params = Factory.newFeatureMap();
21 ProcessingResource pr = (ProcessingResource)
22 Factory.createResource(ANNIEConstants.PR_NAMES[i],
23 params);
24 // add the PR to the pipeline controller
25 annieController.add(pr);
26} // for each ANNIE PR
27
28// Tell ANNIE’s controller about the corpus you want to run on
29Corpus corpus = ...;
30annieController.setCorpus(corpus);
31// Run ANNIE
32annieController.execute();
7.7 Modelling Relations between Annotations [#]
Most text processing tasks in GATE model metadata associated with text snippets as annotations. In some cases, however, it is useful to to have another layer of metadata, associated with the annotations themselves. One such case is the modelling of relations between annotations. One typical example of relations between annotation is that of co-reference. Two annotations of type Person may be referring to the same actual person; in this case the two annotations are said to be co-referring.
Starting with version 7.1, GATE Embedded supports the representation of relations between annotations. A relation set is associated with, and accssed via, an annotation set. All members of a relation must be either annotations from the associated annotation set or other relations within the same set. The classes supporting relations can be found in the gate.relations package.
A relation, as described by the gate.relations.Relation interface, is defined by the following values:
-
id
-
a unique ID that identifies the relation. IDs for both relations and annotations are generated from the same source, guaranteeing that not only is the ID unique among the relations, but also among all annotations from the same document.
-
type
-
a String value describing the type of the relation (e.g. ’coref’ for co-reference relations).
-
members
-
an int[] array, containing the annotation IDs for the annotations referred to by the relation. Note that relations are not guaranteed to be symmetric, so the ordering in the members array is relevant.
-
featureMap
-
a FeatureMap that, like with Annotations, allows the storing of an arbitary set of features for the relation.
-
userData
-
an optional Serializable value, which can be used to associate any arbitrary data with a relation.
Relation sets are modelled by the gate.relations.RelationSet class. The principal API calls published by this class include:
-
public Relation addRelation(String type, int... members)
Creates a new relation with the specified type and member annotations. Returns the newly created relation object. -
public void addRelation(Relation rel)
Adds to this relation set an externally-created relation. This method is provided to support the use of custom implementations of the gate.relations.Relation interface. -
public boolean deleteRelation(Relation relation)
Deletes the specified relation from this relation set. Any relations which include this relation as a member will also be deleted (recursively) to ensure the set remains internally consistent. -
public Collection<Relation> get()
Returns all the relations within this set. -
public Relation get(Integer id)
Returns the relation with the given ID. -
public Collection<Relation> getRelations(String type)
Gets all relations with the specified type contained in this relation set. -
public Collection<Relation> getRelations(int... members)
Gets relations by members. Gets all relations with have the specified members on the specified positions. The required members are represented as an int[], where each required annotation ID is placed on its required position. For unconstrained positions, the constant value gate.relations.RelationSet.ANY should be used. -
public Collection<Relation> getRelations(String type, int... members)
Gets all relations with the specified type and members. -
public Collection<Relation> getReferencing(int id)
Gets all the relations which reference an annotation or relation with the specified ID. -
public int getMaximumArity()
Gets the maximum arity (number of members) for all relations in this relation set.
Included next is a simple code snippet that illustrates the RelationSet API. The function of the example code is to:
-
find all the Sentence annotations inside a document;
-
for each sentence, find all the contained Token annotations;
-
for each sentence and contained token, add a new relation named contained between the token and the sentence.
2Document doc = Factory.newDocument(
3 new File("documents/file.xml").toURI().toURL());
4// get the annotation set
5AnnotationSet annSet = doc.getAnnotations();
6// get the relations set
7RelationSet relSet = annSet.getRelations();
8// get all sentences
9AnnotationSet sentences = annSet.get(
10 ANNIEConstants.SENTENCE_ANNOTATION_TYPE);
11for(Annotation sentence : sentences) {
12 // get all the tokens
13 AnnotationSet tokens = annSet.get(
14 ANNIEConstants.TOKEN_ANNOTATION_TYPE,
15 sentence.getStartNode().getOffset(),
16 sentence.getEndNode().getOffset());
17 for(Annotation token : tokens) {
18 // for each sentence and token, add the contained relation
19 relSet.addRelation("contained",
20 new int[] {token.getId(), sentence.getId()});
21 }
22}
7.8 Duplicating a Resource [#]
Sometimes, particularly in a multi-threaded application, it is useful to be able to create an independent copy of an existing PR, controller or LR. The obvious way to do this is to call createResource again, passing the same class name, parameters, features and name, and for many resources this will do the right thing. However there are some resources for which this may be insufficient (e.g. controllers, which also need to duplicate their PRs), unsafe (if a PR uses temporary files, for instance), or simply inefficient. For example for a large gazetteer this would involve loading a second copy of the lists into memory and compiling them into a second identical state machine representation, but a much more efficient way to achieve the same behaviour would be to use a SharedDefaultGazetteer (see section 13.10), which can re-use the existing state machine.
The GATE Factory provides a duplicate method which takes an existing resource instance and creates and returns an independent copy of the resource. By default it uses the algorithm described above, extracting the parameter values from the template resource and calling createResource to create a duplicate (the actual algorithm is slightly more complicated than this, see the following section). However, if a particular resource type knows of a better way to duplicate itself it can implement the CustomDuplication interface, and provide its own duplicate method which the factory will use instead of performing the default duplication algorithm. A caller who needs to duplicate an existing resource can simply call Factory.duplicate to obtain a copy, which will be constructed in the appropriate way depending on the resource type.
Note that the duplicate object returned by Factory.duplicate will not necessarily be of the same class as the original object. However the contract of Factory.duplicate specifies that where the original object implements any of a list of core GATE interfaces, the duplicate can be assumed to implement the same ones – if you duplicate a DefaultGazetteer the result may not be an instance of DefaultGazetteer but it is guaranteed to implement the Gazetteer interface.
Full details of how to implement a custom duplicate method in your own resource type can be found in the JavaDoc documentation for the CustomDuplication interface and the Factory.duplicate method.
7.8.1 Sharable properties [#]
The @Sharable annotation (in the gate.creole.metadata package) provides a way for a resource to mark JavaBean properties whose values should be shared between a resource and its duplicates. Typical examples of objects that could be marked sharable include large or expensive-to-create data structures that are created by a resource at init time and subsequently used in a read-only fashion, a thread-safe cache of some sort, or state used to create globally unique identifiers (such as an AtomicInteger that is incremented each time a new ID is required). Clearly any ojects that are shared between different resource instances must be accessed by all instances in a way that is thread-safe or appropriately synchronized.
The sharable property must have the standard public getter and setter methods, with the @Sharable annotation applied to the setter4. The same setter may be marked both as a sharable property and as a @CreoleParameter but the two are not related – sharable properties that are not parameters and parameters that are not sharable are both allowed and both have uses in different circumstances. The use of sharable properties removes the need to implement custom duplication in many simple cases.
The default duplication algorithm in full is thus as follows:
-
Extract the values of all init-time parameters from the original resource.
-
Recursively duplicate any of these values that are themselves GATE Resources, except for parameters that are marked as @Sharable (i.e. parameters that are marked sharable are copied directly to the duplicate resource without being duplicated themselves).
-
Add to this parameter map any other sharable properties of the original resource (including those that are not parameters).
-
Extract the features of the original resource and recursively duplicate any values in this map that are themselves resources, as above.
-
Call Factory.createResource passing the class name of the original resource, the duplicated/shared parameters and the duplicated features.
-
this will result in a call to the new resource’s init method, with all sharable properties (parameters and non-parameters) populated with their values from the old resource. The init method must recognise this and adapt its behaviour appropriately, i.e. not re-creating sharable data structures that have already been injected.
-
-
If the original resource is a PR, extract its runtime parameter values (except those that are marked as sharable, which have already been dealt with above), and recursively duplicate any resource values in the map.
-
Set the resulting runtime parameter values on the duplicate resource.
The duplication process keeps track of any recursively-duplicated resources, such that if the same original resource is used in several places (e.g. when duplicating a controller with several JAPE transducer PRs that all refer to the same ontology LR in their runtime parameters) then the same duplicate (ontology) will be used in the same places in the duplicated resource (i.e. all the duplicate transducers will refer to the same ontology LR, which will be a duplicate of the original one).
7.9 Persistent Applications [#]
GATE Embedded allows the persistent storage of applications in a format based on XML serialisation. This is particularly useful for applications management and distribution. A developer can save the state of an application when he/she stops working on its design and continue developing it in a next session. When the application reaches maturity it can be deployed to the client site using the same method.
When an application (i.e. a Controller) is saved, GATE will actually only save the values for the parameters used to create the Processing Resources that are contained in the application. When the application is reloaded, all the PRs will be re-created using the saved parameters.
Many PRs use external resources (files) to define their behaviour and, in most cases, these files are identified using URLs. During the saving process, all the URLs are converted relative URLs based on the location of the application file. This way, if the resources are packaged together with the application file, the entire application can be reliably moved to a different location.
API access to application saving and loading is provided by means of two static methods on the gate.util.persistence.PersistenceManager class, listed in table 7.8.
Method | Purpose |
public static void saveObjectToFile(Object obj, File file) | Saves the data needed to re-create the provided GATE object to the specified file. The Object provided can be any type of Language or Processing Resource or a Controller. The procedures may work for other types of objects as well (e.g. it supports most Collection types). |
public static Object loadObjectFromFile(File file) | Parses the file specified (which needs to be a file created by the above method) and creates the necessary object(s) as specified by the data in the file. Returns the root of the object tree. |
aving and loading a GATE application
2File file = ...;
3//What to save?
4Controller theApplication = ...;
5
6//save
7gate.util.persistence.PersistenceManager.
8 saveObjectToFile(theApplication, file);
9//delete the application
10Factory.deleteResource(theApplication);
11theApplication = null;
12
13[...]
14//load the application back
15theApplication = gate.util.persistence.PersistenceManager.
16 loadObjectFromFile(file);
7.10 Ontologies
Starting from GATE version 3.1, support for ontologies has been added. Ontologies are nominally Language Resources but are quite different from documents and corpora and are detailed in chapter 14.
Classes related to ontologies are to be found in the gate.creole.ontology package and its sub-packages. The top level package defines an abstract API for working with ontologies while the sub-packages contain concrete implementations. A client program should only use the classes and methods defined in the API and never any of the classes or methods from the implementation packages.
The entry point to the ontology API is the gate.creole.ontology.Ontology interface which is the base interface for all concrete implementations. It provides methods for accessing the class hierarchy, listing the instances and the properties.
Ontology implementations are available through plugins. Before an ontology language resource can be created using the gate.Factory and before any of the classes and methods in the API can be used, one of the implementing ontology plugins must be loaded. For details see chapter 14.
7.11 Loading Annotation Schemas [#]
In order to create a gate.creole.AnnotationSchema object from a schema annotation file, one must use the gate.Factory class;
2param.put("xmlFileUrl",annotSchemaFile.toURL());\\
3AnnotationSchema annotSchema = \\
4Factory.createResurce("gate.creole.AnnotationSchema", params);
Note: All the elements and their values must be written in lower case, as XML is defined as case sensitive and the parser used for XML Schema inside GATE searches is case sensitive.
In order to be able to write XML Schema definitions, the ones defined in GATE (resources/creole/schema) can be used as a model, or the user can have a look at http://www.w3.org/2000/10/XMLSchema for a proper description of the semantics of the elements used.
Some examples of annotation schemas are given in Section 5.4.1.
7.12 Creating a New CREOLE Resource [#]
To create a new resource you need to:
-
write a Java class that implements GATE’s beans model;
-
annotate the class with the necessary CREOLE metadata;
-
compile the class, and any others that it uses, into a Java Archive (JAR) file, including a creole.xml file to identify the JAR as a plugin;
-
tell GATE how to find the JAR.
The recommended way to build GATE plugins from version 8.5 onwards is to use the Apache Maven build tool. A JAR file requires certain specific contents in order to be a valid GATE plugin, and GATE provides tools to automate the creation of these as part of a Maven build. For best results you should use Maven 3.5.2 or later.
GATE provides a Maven archetype to create the skeleton of a new plugin including an example AbstractLanguageAnalyser processing resource you can use as a starting point for your own code. To create a new plugin project from the archetype, run the following Maven command (which has been split over several lines for clarity, but should be run as a single command):
mvn archetype:generate -DarchetypeGroupId=uk.ac.gate \ -DarchetypeArtifactId=gate-pr-archetype \ -DarchetypeVersion=8.6
Replace “8.6” with the version of gate-core that you wish to depend on. You will be prompted for several values by Maven:
-
groupId
-
the group ID to use in the generated project POM. In Maven terms a “group” is a set of related JARs maintained and released by the same developer or group – conventionally this is based on the same convention as Java package names, using a reversed form of a DNS domain you own. You can use any value you like here, except that you should not use a group ID starting uk.ac.gate, as that is reserved for core plugins from the GATE team.
-
artifactId
-
the artifact ID for the generated project POM – this will be used as the directory name for the new project on disk and as the first part of the name of the final JAR file.
-
version
-
the initial version number for your new plugin – this should always end with -SNAPSHOT in capital letters, which is a Maven convention denoting work-in-progress code where the same version number can refer to different JAR files over time. The Maven dependency mechanism assumes that only -SNAPSHOT versions can ever change, and JAR files for non-SNAPSHOT versions are immutable and can be cached forever.
-
package
-
the Java package name. Often this is the same as the group ID but this is not strictly required.
-
prClass
-
the class name of the PR class to generate – this must be a valid Java identifier.
-
prName
-
the name of the PR as it will appear to users in the GATE Developer GUI (e.g. in the “new processing resource” popup menu).
Alternatively you can specify any of these values as extra -D options to archetype:generate, e.g. -DprClass=GoldfishTagger.
The archetype will create a new directory named after the artifactId, containing a few files:
-
pom.xml
-
the Maven project descriptor controlling the build process
-
src/main/java/package/prClass.java
-
the PR Java class.
-
src/main/resources/creole.xml
-
the plugin descriptor that identifies this project as a GATE plugin.
-
src/main/resources/resources
-
a directory into which you should put any resource files that your PR requires (e.g. configuration files, JAPE grammars, etc.). The doubled “resources” is deliberate – src/main/resources is the Maven conventional location for non-Java files that should be packaged in the JAR, and GATE requires a folder called resources inside that.
-
src/test
-
some simple tests.
The generated Java class in src/main/java contains some basic CREOLE metadata and an example of how you can configure parameters, and some boilerplate initialization and execution code that you can modify to your requirements.
There is an alternative archetype available called gate-plugin-archetype, which creates the Maven project structure, POM file and creole.xml but not the example Java class. This is useful if you already have an existing CREOLE plugin from an earlier version of GATE that you want to convert to the Maven style. The process is exactly the same as described above, use the same mvn archetype:generate call as before but with -DarchetypeArtifactId=gate-plugin-archetype.
7.12.1 Dependencies [#]
If you need to use other Java libraries in your PR code you should declare them in the <dependencies> block of the pom.xml. You can use https://search.maven.org to find the appropriate XML snippet for each dependency.
If your plugin requires another GATE plugin to operate (for example if it needs to internally create a JAPE transducer PR) then you should declare a dependency on the relevant plugin in src/main/resources/creole.xml (see section 4.7, in particular the REQUIRES element) and GATE will ensure that the other plugin is always loaded before this one, and that this plugin is unloaded whenever the other one is unloaded.
If your plugin has a compile-time dependency on another plugin then you will also need to declare this in pom.xml as well as in creole.xml – the pom dependency should use “provided” scope:
<dependency> <groupId>uk.ac.gate.plugins</groupId> <artifactId>annie</artifactId> <version>8.5</version> <scope>provided</scope> </dependency>
Note that such dependencies are very rarely required, typically only if you need to write a PR class in one plugin that extends (in the Java sense) a PR defined in another plugin. If you simply need to run another plugin’s PR as part of yours then the creole.xml dependency is sufficient as you would create and use the PR via the Factory in the normal way.
2// of this PR and is of type \texttt{ResourceReference}
3FeatureMap params = Utils.featureMap("grammarUrl", grammarLocation);
4LanguageAnalyser jape = (LanguageAnalyser)Factory.createResource(
5 "gate.creole.Transducer", params);
One of the tests created by the archetypes, the GappLoadingTest, will look for any saved application files in src/main/resources and test that they load successfully into GATE. As a side effect, this test will also create two files in the target folder detailing all the other plugins on which this plugin depends. It captures both direct dependencies (REQUIRES entries in creole.xml) and indirect dependencies where other plugins are loaded by one of this plugin’s saved applications, even if there is no hard dependency between them. For example, many plugins have sample applications that require the ANNIE plugin in order to load document reset, tokeniser or JAPE transducer PRs. The information is presented in two ways:
-
a flat file creole-dependencies.txt listing the plugins with the plugin under test on the first row and then other required plugins in the order they were loaded during the GappLoadingTest.
-
a representation of the dependency graph in the GraphViz DOT format (creole-dependencies.gv) with a node for each plugin and an edge for each dependency, coloured red for REQUIRES links and coloured green for dependencies only expressed by the sample saved applications.
7.13 Adding Support for a New Document Format [#]
In order to add a new document format, one needs to extend the gate.DocumentFormat class and to implement an abstract method called:
This method is supposed to implement the functionality of each format reader and to create annotations on the document. Finally the document’s old content will be replaced with a new one containing only the text between markups.
If one needs to add a new textual reader will extend the gate.corpora.TextualDocumentFormat and override the unpackMarkup(doc) method.
This class needs to be implemented under the Java bean specifications because it will be instantiated by GATE using Factory.createResource() method.
The init() method that one needs to add and implement is very important because in here the reader defines its means to be selected successfully by GATE. What one needs to do is to add some specific information into certain static maps defined in DocumentFormat class, that will be used at reader detection time.
After that, a definition of the reader will be placed into the one’s creole.xml file and the reader will be available to GATE.
We present for the rest of the section a complete three step example of adding such a reader. The reader we describe in here is an XML reader.
Step 1
Create a new class called XmlDocumentFormat that extends
gate.corpora.TextualDocumentFormat and add appropriate CREOLE metadata. For
example:
2 autoinstances = {@AutoInstance(hidden = true)})
3public class XmlDocumentFormat extends TextualDocumentFormat {
4
5}
Step 2
Implement the unpackMarkup(Document doc) which performs the required functionality for the reader. Add XML detection means in init() method:
2 // Register XML mime type
3 MimeType mime = new MimeType("text","xml");
4 // Register the class handler for this mime type
5 mimeString2ClassHandlerMap.put(mime.getType()+ "/" + mime.getSubtype(),
6 this);
7 // Register the mime type with mine string
8 mimeString2mimeTypeMap.put(mime.getType() + "/" + mime.getSubtype(),
9 mime);
10 // Register file suffixes for this mime type
11 suffixes2mimeTypeMap.put("xml",mime);
12 suffixes2mimeTypeMap.put("xhtm",mime);
13 suffixes2mimeTypeMap.put("xhtml",mime);
14 // Register magic numbers for this mime type
15 magic2mimeTypeMap.put("<?xml",mime);
16 // Set the mimeType for this language resource
17 setMimeType(mime);
18 return this;
19}// init()
More details about the information from those maps can be found in Section 5.5.1
More information on the operation of GATE’s document format analysers may be found in Section 5.5.
7.14 Using GATE Embedded in a Multithreaded Environment [#]
GATE Embedded can be used in multithreaded applications, so long as you observe a few restrictions. First, you must initialise GATE by calling Gate.init() exactly once in your application, typically in the application startup phase before any concurrent processing threads are started.
Secondly, you must not make calls that affect the global state of GATE (e.g. loading or unloading plugins) in more than one thread at a time. Again, you would typically load all the plugins your application requires at initialisation time. It is safe to create instances of resources in multiple threads concurrently.
Thirdly, it is important to note that individual GATE processing resources, language resources and controllers are by design not thread safe – it is not possible to use a single instance of a controller/PR/LR in multiple threads at the same time – but for a well written resource it should be possible to use several different instances of the same resource at once, each in a different thread. When writing your own resource classes you should bear the following in mind, to ensure that your resource will be useable in this way.
-
Avoid static data. Where possible, you should avoid using static fields in your class, and you should try and take all configuration data via the CREOLE parameters you declare in your creole.xml file. System properties may be appropriate for truly static configuration, such as the location of an external executable, but even then it is generally better to stick to CREOLE parameters – a user may wish to use two different instances of your PR, each talking to a different executable.
-
Read parameters at the correct time. Init-time parameters should be read in the init() (and reInit()) method, and for processing resources runtime parameters should be read at each execute().
-
Use temporary files correctly. If your resource makes use of external temporary files you should create them using File.createTempFile() at init or execute time, as appropriate. Do not use hardcoded file names for temporary files.
-
If there are objects that can be shared between different instances of your resource, make sure these objects are accessed either read-only, or in a thread-safe way. In particular you must be very careful if your resource can take other resource instances as init or runtime parameters (e.g. the Flexible Gazetteer, Section 13.6).
Of course, if you are writing a PR that is simply a wrapper around an external library that imposes these kinds of limitations there is only so much you can do. If your resource cannot be made safe you should document this fact clearly.
All the standard ANNIE PRs are safe when independent instances are used in different threads concurrently, as are the standard transient document, transient corpus and controller classes. A typical pattern of development for a multithreaded GATE-based application is:
-
Develop your GATE processing pipeline in GATE Developer.
-
Save your pipeline as a .gapp file.
-
In your application’s initialisation phase, load n copies of the pipeline using PersistenceManager.loadObjectFromFile() (see the Javadoc documentation for details), or load the pipeline once and then make copies of it using Factory.duplicate as described in section 7.8, and either give one copy to each thread or store them in a pool (e.g. a LinkedList).
-
When you need to process a text, get one copy of the pipeline from the pool, and return it to the pool when you have finished processing.
Alternatively you can use the Spring Framework as described in the next section to handle the pooling for you.
7.15 Using GATE Embedded within a Spring Application [#]
GATE Embedded provides helper classes to allow GATE resources to be created and managed by the Spring framework. These helpers are provided by the gate-spring module, which must be added as a dependency of your project (and which in turn depends on gate-core). To use the helpers in an XML bean definition file, add the following declarations to the top:
<beans xmlns="http://www.springframework.org/schema/beans" xmlns:gate="http://gate.ac.uk/ns/spring" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://gate.ac.uk/ns/spring http://gate.ac.uk/ns/spring.xsd">
You can have Spring initialise GATE:
<gate:init />
For backwards compatibility the <gate:init> element accepts a number of attributes which were used in earlier versions of GATE to specify paths to GATE’s “home” folder and configuration files, but as of GATE 8.5 these options do nothing by default. If you do want to load a user configuration file (for example to configure things like the “add space on markup unpack” feature) then you must explicitly turn off the sandbox mode:
<gate:init run-in-sandbox="false" user-config-file="WEB-INF/user/xml" />
The user-config-file location is interpreted as a Spring “resource” path. If the value is not an absolute URL then Spring will resolve the path in an appropriate way for the type of application context — in a web application it is taken as being relative to the web app root, and you would typically use a location within WEB-INF as shown in the example above. To use an absolute path for gate-home it is not sufficient to use a leading slash (e.g. /opt/myapp/user.xml), for backwards-compatibility reasons Spring will still resolve this relative to your web application. Instead you must specify it as a full URL, i.e. file:/opt/myapp/user.xml.
You can specify CREOLE plugins that should be loaded after GATE has initialised using <gate:extra-plugin> elements, for example:
<gate:init /> <!-- load the standard ANNIE plugin from Maven Central --> <gate:extra-plugin group-id="uk.ac.gate.plugins" artifact-id="annie" vesion="8.5" /> <!-- load a custom directory-based plugin from inside the webapp --> <gate:extra-plugin>WEB-INF/plugins/FishCounter</gate:extra-plugin>
The usual rules apply for the resolution of Maven plugins – GATE will look in .m2/repository under the home directory of the current user, as well as in the Central repository and the GATE team repository online, plus any repositories configured in the current user’s .m2/settings.xml. As well as this you can specify a local “cache” directory which is a Maven repository that will be searched first before trying any remote repositories, as part of the <gate:init> element:
<gate:init> <gate:maven-caches> <value>WEB-INF/maven-cache</value> </gate:maven-caches> </gate:init>
Note that due to restrictions within the Maven resolver this must be a real directory on disk, so in the web application case if you put a cache inside your WAR file it will only be used if the WAR is unpacked by the container, not if it attempts to run the application directly from the compressed WAR.
To create a GATE resource, use the <gate:resource> element.
<gate:resource id="referenceDocument" scope="singleton" resource-class="gate.corpora.DocumentImpl"> <gate:parameters> <entry key="sourceUrl"> <gate:url>WEB-INF/reference.xml</gate:url> </entry> </gate:parameters> <gate:features> <entry key="documentVersion" value="0.1.3" /> <entry key="mainRef"> <value type="java.lang.Boolean">true</value> </entry> </gate:features> </gate:resource>
The children of <gate:parameters> are Spring <entry/> elements, just as you would write when configuring a bean property of type Map<String,Object>. <gate:url> provides a way to construct a java.net.URL from a resource path as discussed above. If it is possible to resolve the resource path as a file: URL then this form will be preferred, as there are a number of areas within GATE which work better with file: URLs than with other types of URL (for example plugins that run external processes, or that use a URL parameter to point to a directory in which they will create new files).
A note about types: The <gate:parameters> and <gate:features> elements define GATE FeatureMaps. When using the simple <entry key="..." value="..." /> form, the entry values will be treated as strings; Spring can convert strings into many other types of object using the standard Java Beans property editor mechanism, but since a FeatureMap can hold any kind of values you must use an explicit <value type="...">...</value> to tell Spring what type the value should be.
There is an additional twist for <gate:parameters> – GATE has its own internal logic to convert strings to other types required for resource parameters (see the discussion of default parameter values in section 4.7.1). So for parameter values you have a choice, you can either use an explicit <value type="..."> to make Spring do the conversion, or you can pass the parameter value as a string and let GATE do the conversion. For resource parameters whose type is gate.creole.ResourceReference, if you pass a string value that is not an absolute URL (starting file:, http:, etc.) then GATE will treat the string as a path relative to the plugin that defines the resource type whose parameter you are setting. If this is not what you intended then you should use <gate:url> to cause Spring to resolve the path to a URL (which GATE will then convert to a ResourceReference) before passing it to GATE. For example, for a JAPE transducer, <entry key="grammarURL" value="grammars/main.jape" /> would resolve to the resource reference creole://uk.ac.gate.plugins;annie;8.5/grammars/main.jape, whereas
<entry key="grammarURL"> <gate:url>grammars/main.jape</gate:url> </entry>
would resolve to file:/path/to/webapp/grammars/main.jape.
You can load a GATE saved application with
<gate:saved-application location="WEB-INF/application.gapp" scope="prototype"> <gate:customisers> <gate:set-parameter pr-name="custom transducer" name="ontology" ref="sharedOntology" /> </gate:customisers> </gate:saved-application>
‘Customisers’ are used to customise the application after it is loaded. In the example above, we assume we have loaded a singleton copy of an ontology which is then shared between all the separate instances of the (prototype) application. The <gate:set-parameter> customiser accepts all the same ways to provide a value as the standard Spring <property> element (a "value" or "ref" attribute, or a sub-element - <value>, <list>, <bean>, <gate:resource> …).
The <gate:add-pr> customiser provides support for the case where most of the application is in a saved state, but we want to create one or two extra PRs with Spring (maybe to inject other Spring beans as init parameters) and add them to the pipeline.
<gate:saved-application ...> <gate:customisers> <gate:add-pr add-before="OrthoMatcher" ref="myPr" /> </gate:customisers> </gate:saved-application>
By default, the <gate:add-pr> customiser adds the target PR at the end of the pipeline, but an add-before or add-after attribute can be used to specify the name of a PR before (or after) which this PR should be placed. Alternatively, an index attribute places the PR at a specific (0-based) index into the pipeline. The PR to add can be specified either as a ‘ref’ attribute, or with a nested <bean> or <gate:resource> element.
7.15.1 Duplication in Spring [#]
The above example defines the <gate:application> as a prototype-scoped bean, which means the saved application state will be loaded afresh each time the bean is fetched from the bean factory (either explicitly using getBean or implicitly when it is injected as a dependency of another bean). However in many cases it is better to load the application once and then duplicate it as required (as described in section 7.8), as this allows resources to optimise their memory usage, for example by sharing a single in-memory representation of a large gazetteer list between several instances of the gazetteer PR. This approach is supported by the <gate:duplicate> tag.
<gate:duplicate id="theApp"> <gate:saved-application location="/WEB-INF/application.xgapp" /> </gate:duplicate>
The <gate:duplicate> tag acts like a prototype bean definition, in that each time it is fetched or injected it will call Factory.duplicate to create a new duplicate of its template resource (declared as a nested element or referenced by the template-ref attribute). However the tag also keeps track of all the duplicate instances it has returned over its lifetime, and will ensure they are released (using Factory.deleteResource) when the Spring context is shut down.
The <gate:duplicate> tag also supports customisers, which will be applied to the newly-created duplicate resource before it is returned. This is subtly different from applying the customisers to the template resource itself, which would cause them to be applied once to the original resource before it is first duplicated.
Finally, <gate:duplicate> takes an optional boolean attribute return-template. If set to false (or omitted, as this is the default behaviour), the tag always returns a duplicate — the original template resource is used only as a template and is not made available for use. If set to true, the first time the bean defined by the tag is injected or fetched, the original template resource is returned. Subsequent uses of the tag will return duplicates. Generally speaking, it is only safe to set return-template="true" when there are no customisers, and when the duplicates will all be created up-front before any of them are used. If the duplicates will be created asynchronously (e.g. with a dynamically expanding pool, see below) then it is possible that, for example, a template application may be duplicated in one thread whilst it is being executed by another thread, which may lead to unpredictable behaviour.
7.15.2 Spring pooling [#]
In a multithreaded application it is vital that individual GATE resources are not used in more than one thread at the same time. Because of this, multithreaded applications that use GATE Embedded often need to use some form of pooling to provided thread-safe access to GATE components. This can be managed by hand, but the Spring framework has built-in tools to support transparent pooling of Spring-managed beans. Spring can create a pool of identical objects, then expose a single “proxy” object (offering the same interface) for use by clients. Each method call on the proxy object will be routed to an available member of the pool in such a way as to guarantee that each member of the pool is accessed by no more than one thread at a time.
Since the pooling is handled at the level of method calls, this approach is not used to create a pool of GATE resources directly — making use of a GATE PR typically involves a sequence of method calls (at least setDocument(doc), execute() and setDocument(null)), and creating a pooling proxy for the resource may result in these calls going to different members of the pool. Instead the typical use of this technique is to define a helper object with a single method that internally calls the GATE API methods in the correct sequence, and then create a pool of these helpers. The interface gate.util.DocumentProcessor and its associated implementation gate.util.LanguageAnalyserDocumentProcessor are useful for this. The DocumentProcessor interface defines a processDocument method that takes a GATE document and performs some processing on it. LanguageAnalyserDocumentProcessor implements this interface using a GATE LanguageAnalyser (such as a saved “corpus pipeline” application) to do the processing. A pool of LanguageAnalyserDocumentProcessor instances can be exposed through a proxy which can then be called from several threads.
The machinery to implement this is all built into Spring, but the configuration typically required to enable it is quite fiddly, involving at least three co-operating bean definitions. Since the technique is so useful with GATE Embedded, GATE provides a special syntax to configure pooling in a simple way.
To use Spring pooling, you need to add a dependency to your project on an appropriate version of org.apache.commons:commons-pool2 or commons-pool:commons-pool5. Now, given the <gate:duplicate id="theApp"> definition from the previous section we can create a DocumentProcessor proxy that can handle up to five concurrent requests as follows:
<bean id="processor" class="gate.util.LanguageAnalyserDocumentProcessor"> <property name="analyser" ref="theApp" /> <gate:pooled-proxy max-size="5" /> </bean>
The <gate:pooled-proxy> element decorates a singleton bean definition. It converts the original definition to prototype scope and replaces it with a singleton proxy delegating to a pool of instances of the prototype bean. The pool parameters are controlled by attributes of the <gate:pooled-proxy> element, the most important ones being:
-
max-size
-
The maximum size of the pool. If more than this number of threads try to call methods on the proxy at the same time, the others will (by default) block until an object is returned to the pool.
-
initial-size
-
The default behaviour of Spring’s pooling tools is to create instances in the pool on demand (up to the max-size). This attribute instead causes initial-size instances to be created up-front and added to the pool when it is first created.
-
when-exhausted-action-name
-
What to do when the pool is exhausted (i.e. there are already max-size concurrent calls in progress and another one arrives). Should be set to one of WHEN_EXHAUSTED_BLOCK (the default, meaning block the excess requests until an object becomes free), WHEN_EXHAUSTED_GROW (create a new object anyway, even though this pushes the pool beyond max-size) or WHEN_EXHAUSTED_FAIL (cause the excess calls to fail with an exception).
Any of these attributes can make use of the usual ${...} property placeholder mechanism. Many more options are available, corresponding to the properties of the underlying Spring TargetSource in use (by default, a slightly customised subclass of CommonsPool2TargetSource or CommonsPoolTargetSource, depending which version of commons-pool you depend on). These allow you, for example, to configure a pool that dynamically grows and shrinks as necessary, releasing objects that have been idle for a set amount of time. See the JavaDoc documentation of CommonsPoolTargetSource (and the documentation for Apache commons-pool) for full details. If you wish to use a different TargetSource implementation from the default you can provide a target-source-class attribute with the fully-qualified class name of the class you wish to use (which must, of course, implement the TargetSource interface).
Note that the <gate:pooled-proxy> technique is not tied to GATE in any way, it is simply an easy way to configure standard Spring beans and can be used with any bean that needs to be pooled, not just objects that make use of GATE.
7.15.3 Further reading [#]
These custom elements all define various factory beans. For full details, see the JavaDocs for the gate-spring module. The main Spring framework API documentation is the best place to look for more detail on the pooling facilities provided by Spring AOP.
7.16 Groovy for GATE [#]
Groovy is a dynamic programming language based on Java. Groovy is not used in the core GATE distribution, so to enable the Groovy features in GATE you must first load the Groovy plugin. Loading this plugin:
-
provides access to the Groovy scripting console (configured with some extensions for GATE) from the GATE Developer “Tools” menu.
-
provides a PR to run a Groovy script over documents.
-
provides a controller which uses a Groovy DSL to define its execution strategy.
-
enhances a number of core GATE classes with additional convenience methods that can be used from any Groovy code including the console, the script PR, and any Groovy class that uses the GATE Embedded API.
This section describes these features in detail, but assumes that the reader already has some knowledge of the Groovy language. If you are not already familiar with Groovy you should read this section in conjunction with Groovy’s own documentation at http://groovy.codehaus.org/.
7.16.1 Groovy Scripting Console for GATE [#]
Loading the Groovy plugin in GATE Developer will provide a “Groovy Console” item in the Tools/Groovy Tools menu. This menu item opens the standard Groovy console window (http://groovy.codehaus.org/Groovy+Console).
To help scripting GATE in Groovy, the console is pre-configured to import all classes from the gate and gate.util packages of the core GATE API. This means you can refer to classes and interfaces such as Factory, AnnotationSet, Gate, etc. without needing to prefix them with a package name. In addition, the following (read-only) variable bindings are pre-defined in the Groovy Console.
-
corpora: a list of loaded corpora LRs (Corpus)
-
docs: a list of all loaded document LRs (DocumentImpl)
-
prs: a list of all loaded PRs
-
apps: a list of all loaded Applications (AbstractController)
These variables are automatically updated as resources are created and deleted in GATE.
Here’s an example script. It finds all documents with a feature “annotator” set to “fred”, and puts them in a new corpus called “fredsDocs”.
You can find other examples (and add your own) in the Groovy script repository on the GATE Wiki: http://gate.ac.uk/wiki/groovy-recipes/.
Why won’t the ‘Groovy executing’ dialog go away? Sometimes, when you execute a Groovy script through the console, a dialog will appear, saying “Groovy is executing. Please wait”. The dialog fails to go away even when the script has ended, and cannot be closed by clicking the “Interrupt” button. You can, however, continue to use the Groovy Console, and the dialog will usually go away next time you run a script. This is not a GATE problem: it is a Groovy problem.
7.16.2 Groovy scripting PR [#]
The Groovy scripting PR enables you to load and execute Groovy scripts as part of a GATE application pipeline. The Groovy scripting PR is made available when you load the Groovy plugin via the plugin manager.
Parameters [#]
The Groovy scripting PR has a single initialisation parameter
-
scriptURL: the path to a valid Groovy script
It has three runtime parameters
-
inputASName: an optional annotation set intended to be used as input by the PR (but note that the PR has access to all annotation sets)
-
outputASName: an optional annotation set intended to be used as output by the PR (but note that the PR has access to all annotation sets)
-
scriptParams: optional parameters for the script. In a creole.xml file, these should be specified as key=value pairs, each pair separated by a comma. For example: ’name=fred,type=person’ . In the GATE GUI, these are specified via a dialog.
Script bindings [#]
As with the Groovy console described above Groovy scripts run by the scripting PR implicitly import all classes from the gate and gate.util packages of the core GATE API. The Groovy scripting PR also makes available the following bindings, which you can use in your scripts:
-
doc: the current document (Document)
-
corpus: the corpus containing the current document
-
controller: the controller running the script
-
content: the string content of the current document
-
inputAS: the annotation set specified by inputASName in the PRs runtime parameters
-
outputAS: the annotation set specified by outputASName in the PRs runtime parameters
Note that inputAS and outputAS are intended to be used as input and output AnnotationSets. This is, however, a convention: there is nothing to stop a script writing to or reading from any AnnotationSet. Also, although the script has access to the corpus containing the document it is running over, it is not generally necessary for the script to iterate over the documents in the corpus itself – the reference is provided to allow the script to access data stored in the FeatureMap of the corpus. Any other variables assigned to within the script code will be added to the binding, and values set while processing one document can be used while processing a later one.
Passing parameters to the script [#]
In addition to the above bindings, one further binding is available to the script:
-
scriptParams: a FeatureMap with keys and values as specified by the scriptParams runtime parameter
For example, if you were to create a scriptParams runtime parameter for your PR, with the keys and values: ’name=fred,type=person’, then the values could be retrieved in your script via scriptParams.name and scriptParams.type. If you populate the scriptParams FeatureMap programmatically, the values will of course have the same types inside the Groovy script, but if you create the FeatureMap with GATE Developer’s parameter editor, the keys and values will all have String type. (If you want to set n=3 in the GUI editor, for example, you can use scriptParams.n as Integer in the Groovy script to obtain the Integer type.)
Controller callbacks [#]
A Groovy script may wish to do some pre- or post-processing before or after processing the documents in a corpus, for example if it is collecting statistics about the corpus. To support this, the script can declare methods beforeCorpus and afterCorpus, taking a single parameter. If the beforeCorpus method is defined and the script PR is running in a corpus pipeline application, the method will be called before the pipeline processes the first document. Similarly, if the afterCorpus method is defined it will be called after the pipeline has completed processing of all the documents in the corpus. In both cases the corpus will be passed to the method as a parameter. If the pipeline aborts with an exception the afterCorpus method will not be called, but if the script declares a method aborted(c) then this will be called instead.
Note that because the script is not processing a particular document when these methods are called, the usual doc, corpus, inputAS, etc. are not available within the body of the methods (though the corpus is passed to the method as a parameter). The scriptParams and controller variables are available.
The following example shows how this technique could be used to build a simple tf/idf index for a GATE corpus. The example is available in the GATE distribution as plugins/Groovy/resources/scripts/tfidf.groovy. The script makes use of some of the utility methods described in section 7.16.4.
2void beforeCorpus(c) {
3 // list of maps (one for each doc) from term to frequency
4 frequencies = []
5 // sorted map from term to docs that contain it
6 docMap = new TreeMap()
7 // index of the current doc in the corpus
8 docNum = 0
9}
10
11// start frequency list for this document
12frequencies << [:]
13
14// iterate over the requested annotations
15inputAS[scriptParams.annotationType].each {
16 def str = doc.stringFor(it)
17 // increment term frequency for this term
18 frequencies[docNum][str] =
19 (frequencies[docNum][str] ?: 0) + 1
20
21 // keep track of which documents this term appears in
22 if(!docMap[str]) {
23 docMap[str] = new LinkedHashSet()