Developing Language Processing
Components with GATE
Version 8 (a User Guide)
For GATE version 8.4.1
(built June 9, 2017)
Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj
Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic,
Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, Wim
Peters, Leon Derczynski, et al
©The University of Sheffield, Department of Computer Science 2001-2017
https://gate.ac.uk/
Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT) and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), Ontotext Matrixware, the Information Retrieval Facility and several EU-funded projects: (TrendMiner, uComp, Arcomem, SEKT, TAO, NeOn, MediaCampaign, Musing, KnowledgeWeb, PrestoSpace, h-TechSight, and enIRaF).
I GATE Basics
1 Introduction
1.1 How to Use this Text
1.2 Context
1.3 Overview
1.4 Some Evaluations
1.5 Recent Changes
1.6 Further Reading
2 Installing and Running GATE
2.1 Downloading GATE
2.2 Installing and Running GATE
2.3 Using System Properties with GATE
2.4 Changing GATE’s launch configuration
2.5 Configuring GATE
2.6 Building GATE
2.7 Uninstalling GATE
2.8 Troubleshooting
3 Using GATE Developer
3.1 The GATE Developer Main Window
3.2 Loading and Viewing Documents
3.3 Creating and Viewing Corpora
3.4 Working with Annotations
3.5 Using CREOLE Plugins
3.6 Installing and updating CREOLE Plugins
3.7 Loading and Using Processing Resources
3.8 Creating and Running an Application
3.9 Saving Applications and Language Resources
3.10 Keyboard Shortcuts
3.11 Miscellaneous
4 CREOLE: the GATE Component Model
4.1 The Web and CREOLE
4.2 The GATE Framework
4.3 The Lifecycle of a CREOLE Resource
4.4 Processing Resources and Applications
4.5 Language Resources and Datastores
4.6 Built-in CREOLE Resources
4.7 CREOLE Resource Configuration
4.8 Tools: How to Add Utilities to GATE Developer
5 Language Resources: Corpora, Documents and Annotations
5.1 Features: Simple Attribute/Value Data
5.2 Corpora: Sets of Documents plus Features
5.3 Documents: Content plus Annotations plus Features
5.4 Annotations: Directed Acyclic Graphs
5.5 Document Formats
5.6 XML Input/Output
6 ANNIE: a Nearly-New Information Extraction System
6.1 Document Reset
6.2 Tokeniser
6.3 Gazetteer
6.4 Sentence Splitter
6.5 RegEx Sentence Splitter
6.6 Part of Speech Tagger
6.7 Semantic Tagger
6.8 Orthographic Coreference (OrthoMatcher)
6.9 Pronominal Coreference
6.10 A Walk-Through Example
II GATE for Advanced Users
7 GATE Embedded
7.1 Quick Start with GATE Embedded
7.2 Resource Management in GATE Embedded
7.3 Using CREOLE Plugins
7.4 Language Resources
7.5 Processing Resources
7.6 Controllers
7.7 Modelling Relations between Annotations
7.8 Duplicating a Resource
7.9 Persistent Applications
7.10 Ontologies
7.11 Creating a New Annotation Schema
7.12 Creating a New CREOLE Resource
7.13 Adding Support for a New Document Format
7.14 Using GATE Embedded in a Multithreaded Environment
7.15 Using GATE Embedded within a Spring Application
7.16 Using GATE Embedded within a Tomcat Web Application
7.17 Groovy for GATE
7.18 Saving Config Data to gate.xml
7.19 Annotation merging through the API
7.20 Using Resource Helpers to Extend the API
8 JAPE: Regular Expressions over Annotations
8.1 The Left-Hand Side
8.2 LHS Operators in Detail
8.3 The Right-Hand Side
8.4 Use of Priority
8.5 Using Phases Sequentially
8.6 Using Java Code on the RHS
8.7 Optimising for Speed
8.8 Ontology Aware Grammar Transduction
8.9 Serializing JAPE Transducer
8.10 Notes for Montreal Transducer Users
8.11 JAPE Plus
9 ANNIC: ANNotations-In-Context
9.1 Instantiating SSD
9.2 Search GUI
9.3 Using SSD from GATE Embedded
10 Performance Evaluation of Language Analysers
10.1 Metrics for Evaluation in Information Extraction
10.2 The Annotation Diff Tool
10.3 Corpus Quality Assurance
10.4 Corpus Benchmark Tool
10.5 A Plugin Computing Inter-Annotator Agreement (IAA)
10.6 A Plugin Computing the BDM Scores for an Ontology
10.7 Quality Assurance Summariser for Teamware
11 Profiling Processing Resources
11.1 Overview
11.2 Graphical User Interface
11.3 Command Line Interface
11.4 Application Programming Interface
12 Developing GATE
12.1 Reporting Bugs and Requesting Features
12.2 Contributing Patches
12.3 Creating New Plugins
12.4 Updating this User Guide
III CREOLE Plugins
13 Gazetteers
13.1 Introduction to Gazetteers
13.2 ANNIE Gazetteer
13.3 OntoGazetteer
13.4 Gaze Ontology Gazetteer Editor
13.5 Hash Gazetteer
13.6 Flexible Gazetteer
13.7 Gazetteer List Collector
13.8 OntoRoot Gazetteer
13.9 Large KB Gazetteer
13.10 The Shared Gazetteer for multithreaded processing
14 Working with Ontologies
14.1 Data Model for Ontologies
14.2 Ontology Event Model
14.3 The Ontology Plugin: Current Implementation
14.4 The Ontology_OWLIM2 plugin: backwards-compatible implementation
14.5 GATE Ontology Editor
14.6 Ontology Annotation Tool
14.7 Relation Annotation Tool
14.8 Using the ontology API
14.9 Using the ontology API (old version)
14.10 Ontology-Aware JAPE Transducer
14.11 Annotating Text with Ontological Information
14.12 Populating Ontologies
14.13 Ontology API and Implementation Changes
15 Non-English Language Support
15.1 Language Identification
15.2 French Plugin
15.3 German Plugin
15.4 Romanian Plugin
15.5 Arabic Plugin
15.6 Chinese Plugin
15.7 Hindi Plugin
15.8 Russian Plugin
15.9 Bulgarian Plugin
15.10 Danish Plugin
15.11 Welsh Plugin
16 Domain Specific Resources
16.1 Biomedical Support
17 Tools for Social Media Data
17.1 Tools for Twitter
17.2 Twitter JSON format
17.3 Exporting GATE documents as JSON
17.4 Low-level PRs for Tweets
17.5 Handling multi-word hashtags
17.6 The TwitIE Pipeline
18 Parsers
18.1 RASP Parser
18.2 SUPPLE Parser
18.3 Stanford Parser
19 Machine Learning
19.1 ML Generalities
19.2 Batch Learning PR
19.3 Machine Learning PR
20 Tools for Alignment Tasks
20.1 Introduction
20.2 The Tools
21 Crowdsourcing Data with GATE
21.1 The Basics
21.2 Entity classification
21.3 Entity annotation
22 Combining GATE and UIMA
22.1 Embedding a UIMA AE in GATE
22.2 Embedding a GATE CorpusController in UIMA
23 More (CREOLE) Plugins
23.1 Verb Group Chunker
23.2 Noun Phrase Chunker
23.3 TaggerFramework
23.4 Chemistry Tagger
23.5 Lupedia Semantic Annotation Service
23.6 TextRazor Annotation Service
23.7 Annotating Numbers
23.8 Annotating Measurements
23.9 Annotating and Normalizing Dates
23.10 Snowball Based Stemmers
23.11 GATE Morphological Analyzer
23.12 Flexible Exporter
23.13 Configurable Exporter
23.14 Annotation Set Transfer
23.15 Schema Enforcer
23.16 Information Retrieval in GATE
23.17 Websphinx Web Crawler
23.18 WordNet in GATE
23.19 Kea - Automatic Keyphrase Detection
23.20 Annotation Merging Plugin
23.21 Copying Annotations between Documents
23.22 LingPipe Plugin
23.23 OpenNLP Plugin
23.24 Stanford CoreNLP
23.25 Content Detection Using Boilerpipe
23.26 Inter Annotator Agreement
23.27 Schema Annotation Editor
23.28 Coref Tools Plugin
23.29 Pubmed Format
23.30 MediaWiki Format
23.31 Fast Infoset Document Format
23.32 DataSift Document Format
23.33 CSV Document Support
23.34 TermRaider term extraction tools
23.35 Document Normalizer
23.36 Developer Tools
23.37 Linguistic Simplifier
23.38 GATE-Time
IV The GATE Family: Cloud, MIMIR, Teamware
24 GATE Cloud
24.1 GATE Cloud services: an overview
24.2 Using GATE Cloud services
24.3 Annotation Jobs on GATE Cloud
25 GATE Teamware: A Web-based Collaborative Corpus Annotation Tool
25.1 Introduction
25.2 Requirements for Multi-Role Collaborative Annotation Environments
25.3 Teamware: Architecture, Implementation, and Examples
25.4 Practical Applications
26 GATE Mímir
Appendices
A Change Log
A.1 Version 8.4.1 (June 2017)
A.2 Version 8.4 (February 2017)
A.3 Version 8.3 (January 2017)
A.4 Version 8.2 (May 2016)
A.5 Version 8.1 (June 2015)
A.6 Version 8.0 (May 2014)
A.7 Version 7.1 (November 2012)
A.8 Version 7.0 (February 2012)
A.9 Version 6.1 (April 2011)
A.10 Version 6.0 (November 2010)
A.11 Version 5.2.1 (May 2010)
A.12 Version 5.2 (April 2010)
A.13 Version 5.1 (December 2009)
A.14 Version 5.0 (May 2009)
A.15 Version 4.0 (July 2007)
A.16 Version 3.1 (April 2006)
A.17 January 2005
A.18 December 2004
A.19 September 2004
A.20 Version 3 Beta 1 (August 2004)
A.21 July 2004
A.22 June 2004
A.23 April 2004
A.24 March 2004
A.25 Version 2.2 – August 2003
A.26 Version 2.1 – February 2003
A.27 June 2002
B Version 5.1 Plugins Name Map
C Obsolete CREOLE Plugins
C.1 Ontotext JapeC Compiler
C.2 Google Plugin
C.3 Yahoo Plugin
C.4 Gazetteer Visual Resource - GAZE
C.5 Google Translator PR
D Design Notes
D.1 Patterns
D.2 Exception Handling
E Ant Tasks for GATE
E.1 Declaring the Tasks
E.2 The packagegapp task - bundling an application with its dependencies
E.3 The expandcreoles Task - Merging Annotation-Driven Config into creole.xml
F Named-Entity State Machine Patterns
F.1 Main.jape
F.2 first.jape
F.3 firstname.jape
F.4 name.jape
F.5 name_post.jape
F.6 date_pre.jape
F.7 date.jape
F.8 reldate.jape
F.9 number.jape
F.10 address.jape
F.11 url.jape
F.12 identifier.jape
F.13 jobtitle.jape
F.14 final.jape
F.15 unknown.jape
F.16 name_context.jape
F.17 org_context.jape
F.18 loc_context.jape
F.19 clean.jape
G Part-of-Speech Tags used in the Hepple Tagger
H Copyright and Licence
References
Colophon