Developing Language Processing
Components with GATE
Version 5 (a User Guide)
For GATE version 5.0-beta1
(built October 31, 2008)
Hamish Cunningham
Diana Maynard
Kalina Bontcheva
Valentin Tablan
Cristian Ursu
Marin Dimitrov
Mike Dowman
Niraj Aswani
Ian Roberts
Yaoyong Li
Andrey Shafirin
Adam Funk
©The University of Sheffield 2001-2008
http://gate.ac.uk/
Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT) and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), and several EU-funded projects (SEKT, TAO, NeOn, MediaCampaign, MUSING, KnowledgeWeb, PrestoSpace, h-TechSight, enIRaF).
1 Introduction
1.1 How to Use This Text
1.2 Context
1.3 Overview
1.4 Structure of the Book
1.5 Further Reading
2 Change Log
2.1 Version 5.0-beta1 (October 2008)
2.2 Version 4.0 (July 2007)
2.3 Version 3.1 (April 2006)
2.4 January 2005
2.5 December 2004
2.6 September 2004
2.7 Version 3 Beta 1 (August 2004)
2.8 July 2004
2.9 June 2004
2.10 April 2004
2.11 March 2004
2.12 Version 2.2 – August 2003
2.13 Version 2.1 – February 2003
2.14 June 2002
3 How To…
3.1 Download GATE
3.2 Install and Run GATE
3.3 [D,F] Use System Properties with GATE
3.4 [D,F] Use (CREOLE) Plug-ins
3.5 Troubleshooting
3.6 [D] Get Started with the GUI
3.7 [D,F] Configure GATE
3.8 Build GATE
3.9 [D] Use GATE with Maven or JPF
3.10 [D,F] Create a New CREOLE Resource
3.11 [F] Instantiate CREOLE Resources
3.12 [D] Load CREOLE Resources
3.13 [D,F] Configure CREOLE Resources
3.14 [D] Create and Run an Application
3.15 [D] Run PRs Conditionally on Document Features
3.16 [D] View Annotations
3.17 [D] Do Information Extraction with ANNIE
3.18 [D] Modify ANNIE
3.19 [D] Create and Edit Test Data
3.20 [D,F] Create a New Annotation Schema
3.21 [D] Save and Restore LRs in Data Stores
3.22 [D] Save Resource Parameter State to File
3.23 [D,F] Perform Evaluation with the AnnotationDiff tool
3.24 [D] Use the Corpus Benchmark Evaluation tool
3.25 [D] Write JAPE Grammars
3.26 [F] Embed NLE in other Applications
3.27 [F] Use GATE within a Spring application
3.28 [F] Use GATE within a Tomcat Web Application
3.29 [F] Use GATE in a Multithreaded Environment
3.30 [D,F] Add support for a new document format
3.31 [D] Dump Results to File
3.32 [D] Stop GUI ‘Freezing’ on Linux
3.33 [D] Stop GUI Crashing on Linux
3.34 [D] Stop GATE Restoring GUI Sessions/Options
3.35 Work with Unicode
3.36 Work with Oracle and PostgreSQL
4 CREOLE: the GATE Component Model
4.1 The Web and CREOLE
4.2 Java Beans: a Simple Component Architecture
4.3 The GATE Framework
4.4 Language Resources and Processing Resources
4.5 The Lifecycle of a CREOLE Resource
4.6 Processing Resources and Applications
4.7 Language Resources and Datastores
4.8 Built-in CREOLE Resources
4.9 CREOLE Resource Configuration
5 Visual CREOLE
5.1 Gazetteer Visual Resource - GAZE
5.2 Ontogazetteer
5.3 The Co-reference Editor
6 Language Resources: Corpora, Documents and Annotations
6.1 Features: Simple Attribute/Value Data
6.2 Corpora: Sets of Documents plus Features
6.3 Documents: Content plus Annotations plus Features
6.4 Annotations: Directed Acyclic Graphs
6.5 Document Formats
6.6 XML Input/Output
7 JAPE: Regular Expressions Over Annotations
7.1 Matching operators in detail
7.2 Use of Context
7.3 Use of Priority
7.4 Use of negation
7.5 Useful tricks
7.6 Ontology aware grammar transduction
7.7 Using Java code in JAPE rules
7.8 Optimising for speed
7.9 Serializing JAPE Transducer
7.10 The JAPE Debugger
7.11 Notes for Montreal Transducer users
8 ANNIE: a Nearly-New Information Extraction System
8.1 Tokeniser
8.2 Gazetteer
8.3 Sentence Splitter
8.4 RegEx Sentence Splitter
8.5 Part of Speech Tagger
8.6 Semantic Tagger
8.7 Orthographic Coreference (OrthoMatcher)
8.8 Pronominal Coreference
8.9 A Walk-Through Example
9 (More CREOLE) Plugins
9.1 Document Reset
9.2 Verb Group Chunker
9.3 Noun Phrase Chunker
9.4 OntoText Gazetteer
9.5 Flexible Gazetteer
9.6 Gazetteer List Collector
9.7 Tree Tagger
9.8 Stemmer
9.9 GATE Morphological Analyzer
9.10 MiniPar Parser
9.11 RASP Parser
9.12 SUPPLE Parser (formerly BuChart)
9.13 Stanford Parser
9.14 Montreal Transducer
9.15 Language Plugins
9.16 Chemistry Tagger
9.17 Flexible Exporter
9.18 Annotation Set Transfer
9.19 Information Retrieval in GATE
9.20 Crawler
9.21 Google Plugin
9.22 Yahoo Plugin
9.23 WordNet in GATE
9.24 Machine Learning in GATE
9.25 MinorThird
9.26 MIAKT NLG Lexicon
9.27 Kea - Automatic Keyphrase Detection
9.28 Ontotext JapeC Compiler
9.29 ANNIC
9.30 Annotation Merging
9.31 OntoRoot Gazetteer
10 Working with Ontologies
10.1 Data Model for Ontologies
10.2 Ontology Event Model (new in Gate 4)
10.3 OWLIM Ontology LR
10.4 GATE’s Ontology Editor
10.5 Instantiating OWLIM Ontology using GATE API
10.6 Ontology-Aware JAPE Transducer
10.7 Annotating text with Ontological Information
10.8 Populating Ontologies
10.9 Ontology Annotation Tool
11 Machine Learning API
.1ML Generalities
.2The Batch Learning PR in GATE
.3Examples of configuration file for the three learning types
.4How to use the ML API
.5The outputs of the ML API
12 Tools for Alignment Tasks
12.1 Introduction
12.2 Tools for Alignment Tasks
13 Performance Evaluation of Language Analysers
13.1 The AnnotationDiff Tool
13.2 The six annotation relations explained
13.3 Benchmarking tool
13.4 Metrics for Evaluation in Information Extraction
13.5 Metrics for Evaluation of Inter-Annotator Agreement
13.6 A Plugin for Computing Inter-Annotator Agreement
14 Users, Groups, and LR Access Rights
14.1 Java serialisation and LR access rights
14.2 Oracle Datastore and LR access rights
15 Developing GATE
15.1 Creating new plugins
15.2 Updating this User Guide
16 Combining GATE and UIMA
16.1 Embedding a UIMA TAE in GATE
16.2 Embedding a GATE CorpusController in UIMA
Appendices
Appendices
A Design Notes
A.1 Patterns
A.2 Exception Handling
B JAPE: Implementation
B.1 Formal Description of the JAPE Grammar
B.2 Relation to CPSL
B.3 Algorithms for JAPE Rule Application
B.4 Label Binding Scheme
B.5 Classes
B.6 Implementation
B.7 Compilation
B.8 Using a Different Java Compiler
C Named-Entity State Machine Patterns
C.1 Main.jape
C.2 first.jape
C.3 firstname.jape
C.4 name.jape
C.5 name_post.jape
C.6 date_pre.jape
C.7 date.jape
C.8 reldate.jape
C.9 number.jape
C.10 address.jape
C.11 url.jape
C.12 identifier.jape
C.13 jobtitle.jape
C.14 final.jape
C.15 unknown.jape
C.16 name_context.jape
C.17 org_context.jape
C.18 loc_context.jape
C.19 clean.jape
D Part-of-Speech Tags used in the Hepple Tagger
E Sample ML Configuration File
F IAA Measures for Classification Tasks
References
References
Colophon