Contents
I GATE Basics
1 Introduction
1.1 How to Use this Text
1.2 Context
1.3 Overview
1.3.1 Developing and Deploying Language Processing Facilities
1.3.2 Built-In Components
1.3.3 Additional Facilities in GATE Developer/Embedded
1.3.4 An Example
1.4 Some Evaluations
1.5 Recent Changes
1.5.1 Version 8.0 (May 2014)
1.6 Further Reading
2 Installing and Running GATE
2.1 Downloading GATE
2.2 Installing and Running GATE
2.2.1 The Easy Way
2.2.2 The Hard Way (1)
2.2.3 The Hard Way (2): Subversion
2.2.4 Running GATE Developer on Unix/Linux
2.3 Using System Properties with GATE
2.4 Changing GATE’s launch configuration
2.5 Configuring GATE
2.6 Building GATE
2.6.1 Using GATE with Maven/Ivy
2.7 Uninstalling GATE
2.8 Troubleshooting
3 Using GATE Developer
3.1 The GATE Developer Main Window
3.2 Loading and Viewing Documents
3.3 Creating and Viewing Corpora
3.4 Working with Annotations
3.4.1 The Annotation Sets View
3.4.2 The Annotations List View
3.4.3 The Annotations Stack View
3.4.4 The Co-reference Editor
3.4.5 Creating and Editing Annotations
3.4.6 Schema-Driven Editing
3.4.7 Printing Text with Annotations
3.5 Using CREOLE Plugins
3.6 Installing and updating CREOLE Plugins
3.7 Loading and Using Processing Resources
3.8 Creating and Running an Application
3.8.1 Running an Application on a Datastore
3.8.2 Running PRs Conditionally on Document Features
3.8.3 Doing Information Extraction with ANNIE
3.8.4 Modifying ANNIE
3.9 Saving Applications and Language Resources
3.9.1 Saving Documents to File
3.9.2 Saving and Restoring LRs in Datastores
3.9.3 Saving Application States to a File
3.9.4 Saving an Application with its Resources (e.g. GATECloud.net)
3.10 Keyboard Shortcuts
3.11 Miscellaneous
3.11.1 Stopping GATE from Restoring Developer Sessions/Options
3.11.2 Working with Unicode
4 CREOLE: the GATE Component Model
4.1 The Web and CREOLE
4.2 The GATE Framework
4.3 The Lifecycle of a CREOLE Resource
4.4 Processing Resources and Applications
4.5 Language Resources and Datastores
4.6 Built-in CREOLE Resources
4.7 CREOLE Resource Configuration
4.7.1 Configuration with XML
4.7.2 Configuring Resources using Annotations
4.7.3 Mixing the Configuration Styles
4.7.4 Loading Third-Party Libraries using Apache Ivy
4.8 Tools: How to Add Utilities to GATE Developer
4.8.1 Putting Your Tools in a Sub-Menu
4.8.2 Adding Tools To Existing Resource Types
5 Language Resources: Corpora, Documents and Annotations
5.1 Features: Simple Attribute/Value Data
5.2 Corpora: Sets of Documents plus Features
5.3 Documents: Content plus Annotations plus Features
5.4 Annotations: Directed Acyclic Graphs
5.4.1 Annotation Schemas
5.4.2 Examples of Annotated Documents
5.4.3 Creating, Viewing and Editing Diverse Annotation Types
5.5 Document Formats
5.5.1 Detecting the Right Reader
5.5.2 XML
5.5.3 HTML
5.5.4 SGML
5.5.5 Plain text
5.5.6 RTF
5.5.7 Email
5.5.8 PDF Files and Office Documents
5.5.9 UIMA CAS Documents
5.5.10 CoNLL/IOB Documents
5.6 XML Input/Output
6 ANNIE: a Nearly-New Information Extraction System
6.1 Document Reset
6.2 Tokeniser
6.2.1 Tokeniser Rules
6.2.2 Token Types
6.2.3 English Tokeniser
6.3 Gazetteer
6.4 Sentence Splitter
6.5 RegEx Sentence Splitter
6.6 Part of Speech Tagger
6.7 Semantic Tagger
6.8 Orthographic Coreference (OrthoMatcher)
6.8.1 GATE Interface
6.8.2 Resources
6.8.3 Processing
6.9 Pronominal Coreference
6.9.1 Quoted Speech Submodule
6.9.2 Pleonastic It Submodule
6.9.3 Pronominal Resolution Submodule
6.9.4 Detailed Description of the Algorithm
6.10 A Walk-Through Example
6.10.1 Step 1 - Tokenisation
6.10.2 Step 2 - List Lookup
6.10.3 Step 3 - Grammar Rules
II GATE for Advanced Users
7 GATE Embedded
7.1 Quick Start with GATE Embedded
7.2 Resource Management in GATE Embedded
7.3 Using CREOLE Plugins
7.4 Language Resources
7.4.1 GATE Documents
7.4.2 Feature Maps
7.4.3 Annotation Sets
7.4.4 Annotations
7.4.5 GATE Corpora
7.5 Processing Resources
7.6 Controllers
7.7 Modelling Relations between Annotations
7.8 Duplicating a Resource
7.8.1 Sharable properties
7.9 Persistent Applications
7.10 Ontologies
7.11 Creating a New Annotation Schema
7.12 Creating a New CREOLE Resource
7.13 Adding Support for a New Document Format
7.14 Using GATE Embedded in a Multithreaded Environment
7.15 Using GATE Embedded within a Spring Application
7.15.1 Duplication in Spring
7.15.2 Spring pooling
7.15.3 Further reading
7.16 Using GATE Embedded within a Tomcat Web Application
7.16.1 Recommended Directory Structure
7.16.2 Configuration Files
7.16.3 Initialization Code
7.17 Groovy for GATE
7.17.1 Groovy Scripting Console for GATE
7.17.2 Groovy scripting PR
7.17.3 The Scriptable Controller
7.17.4 Utility methods
7.18 Saving Config Data to gate.xml
7.19 Annotation merging through the API
7.20 Using Resource Helpers to Extend the API
8 JAPE: Regular Expressions over Annotations
8.1 The Left-Hand Side
8.1.1 Matching Entire Annotation Types
8.1.2 Using Features and Values
8.1.3 Using Meta-Properties
8.1.4 Building complex patterns from simple patterns
8.1.5 Matching a Simple Text String
8.1.6 Using Templates
8.1.7 Multiple Pattern/Action Pairs
8.1.8 LHS Macros
8.1.9 Multi-Constraint Statements
8.1.10 Using Context
8.1.11 Negation
8.1.12 Escaping Special Characters
8.2 LHS Operators in Detail
8.2.1 Equality Operators
8.2.2 Comparison Operators
8.2.3 Regular Expression Operators
8.2.4 Contextual Operators
8.2.5 Custom Operators
8.3 The Right-Hand Side
8.3.1 A Simple Example
8.3.2 Copying Feature Values from the LHS to the RHS
8.3.3 Optional or Empty Labels
8.3.4 RHS Macros
8.4 Use of Priority
8.5 Using Phases Sequentially
8.6 Using Java Code on the RHS
8.6.1 A More Complex Example
8.6.2 Adding a Feature to the Document
8.6.3 Finding the Tokens of a Matched Annotation
8.6.4 Using Named Blocks
8.6.5 Java RHS Overview
8.7 Optimising for Speed
8.8 Ontology Aware Grammar Transduction
8.9 Serializing JAPE Transducer
8.9.1 How to Serialize?
8.9.2 How to Use the Serialized Grammar File?
8.10 Notes for Montreal Transducer Users
8.11 JAPE Plus
9 ANNIC: ANNotations-In-Context
9.1 Instantiating SSD
9.2 Search GUI
9.2.1 Overview
9.2.2 Syntax of Queries
9.2.3 Top Section
9.2.4 Central Section
9.2.5 Bottom Section
9.3 Using SSD from GATE Embedded
9.3.1 How to instantiate a searchabledatastore
9.3.2 How to search in this datastore
10 Performance Evaluation of Language Analysers
10.1 Metrics for Evaluation in Information Extraction
10.1.1 Annotation Relations
10.1.2 Cohen’s Kappa
10.1.3 Precision, Recall, F-Measure
10.1.4 Macro and Micro Averaging
10.2 The Annotation Diff Tool
10.2.1 Performing Evaluation with the Annotation Diff Tool
10.2.2 Creating a Gold Standard with the Annotation Diff Tool
10.3 Corpus Quality Assurance
10.3.1 Description of the interface
10.3.2 Step by step usage
10.3.3 Details of the Corpus statistics table
10.3.4 Details of the Document statistics table
10.3.5 GATE Embedded API for the measures
10.3.6 sec:eval:qapr
10.4 Corpus Benchmark Tool
10.4.1 Preparing the Corpora for Use
10.4.2 Defining Properties
10.4.3 Running the Tool
10.4.4 The Results
10.5 A Plugin Computing Inter-Annotator Agreement (IAA)
10.5.1 IAA for Classification
10.5.2 IAA For Named Entity Annotation
10.5.3 The BDM-Based IAA Scores
10.6 A Plugin Computing the BDM Scores for an Ontology
10.7 Quality Assurance Summariser for Teamware
11 Profiling Processing Resources
11.1 Overview
11.1.1 Features
11.1.2 Limitations
11.2 Graphical User Interface
11.3 Command Line Interface
11.4 Application Programming Interface
11.4.1 Log4j.properties
11.4.2 Benchmark log format
11.4.3 Enabling profiling
11.4.4 Reporting tool
12 Developing GATE
12.1 Reporting Bugs and Requesting Features
12.2 Contributing Patches
12.3 Creating New Plugins
12.3.1 What to Call your Plugin
12.3.2 Writing a New PR
12.3.3 Writing a New VR
12.3.4 Writing a ‘Ready Made’ Application
12.3.5 Distributing Your New Plugins
12.4 Updating this User Guide
12.4.1 Building the User Guide
12.4.2 Making Changes to the User Guide
III CREOLE Plugins
13 Gazetteers
13.1 Introduction to Gazetteers
13.2 ANNIE Gazetteer
13.2.1 Creating and Modifying Gazetteer Lists
13.2.2 ANNIE Gazetteer Editor
13.3 OntoGazetteer
13.4 Gaze Ontology Gazetteer Editor
13.4.1 The Gaze Gazetteer List and Mapping Editor
13.4.2 The Gaze Ontology Editor
13.5 Hash Gazetteer
13.5.1 Prerequisites
13.5.2 Parameters
13.6 Flexible Gazetteer
13.7 Gazetteer List Collector
13.8 OntoRoot Gazetteer
13.8.1 How Does it Work?
13.8.2 Initialisation of OntoRoot Gazetteer
13.8.3 Simple steps to run OntoRoot Gazetteer
13.9 Large KB Gazetteer
13.9.1 Quick usage overview
13.9.2 Dictionary setup
13.9.3 Additional dictionary configuration
13.9.4 Dictionary for Gazetteer List Files
13.9.5 Processing Resource Configuration
13.9.6 Runtime configuration
13.9.7 Semantic Enrichment PR
13.10 The Shared Gazetteer for multithreaded processing
14 Working with Ontologies
14.1 Data Model for Ontologies
14.1.1 Hierarchies of Classes and Restrictions
14.1.2 Instances
14.1.3 Hierarchies of Properties
14.1.4 URIs
14.2 Ontology Event Model
14.2.1 What Happens when a Resource is Deleted?
14.3 The Ontology Plugin: Current Implementation
14.3.1 The OWLIMOntology Language Resource
14.3.2 The ConnectSesameOntology Language Resource
14.3.3 The CreateSesameOntology Language Resource
14.3.4 The OWLIM2 Backwards-Compatible Language Resource
14.3.5 Using Ontology Import Mappings
14.3.6 Using BigOWLIM
14.3.7 The sesameCLI command line interface
14.4 The Ontology_OWLIM2 plugin: backwards-compatible implementation
14.4.1 The OWLIMOntologyLR Language Resource
14.5 GATE Ontology Editor
14.6 Ontology Annotation Tool
14.6.1 Viewing Annotated Text
14.6.2 Editing Existing Annotations
14.6.3 Adding New Annotations
14.6.4 Options
14.7 Relation Annotation Tool
14.7.1 Description of the two views
14.7.2 Create new annotation and instance from text selection
14.7.3 Create new annotation and add label to existing instance from text selection
14.7.4 Create and set properties for annotation relation
14.7.5 Delete instance, label or property
14.7.6 Differences with OAT and Ontology Editor
14.8 Using the ontology API
14.9 Using the ontology API (old version)
14.10 Ontology-Aware JAPE Transducer
14.11 Annotating Text with Ontological Information
14.12 Populating Ontologies
14.13 Ontology API and Implementation Changes
14.13.1 Differences between the implementation plugins
14.13.2 Changes in the Ontology API
15 Non-English Language Support
15.1 Language Identification
15.1.1 Fingerprint Generation
15.2 French Plugin
15.3 German Plugin
15.4 Romanian Plugin
15.5 Arabic Plugin
15.6 Chinese Plugin
15.6.1 Chinese Word Segmentation
15.7 Hindi Plugin
15.8 Russian Plugin
15.9 Bulgarian Plugin
16 Domain Specific Resources
16.1 Biomedical Support
16.1.1 ABNER
16.1.2 MetaMap
16.1.3 GSpell biomedical spelling suggestion and correction
16.1.4 BADREX
16.1.5 MiniChem/Drug Tagger
16.1.6 AbGene
16.1.7 GENIA
16.1.8 Penn BioTagger
16.1.9 MutationFinder
16.1.10 NormaGene
17 Tools for Social Media Data
17.1 Tools for Twitter
17.1.1 Twitter JSON format
17.2 Low-level PRs for Tweets
17.3 Handling multi-word hashtags
17.4 The TwitIE Pipeline
18 Parsers
18.1 MiniPar Parser
18.1.1 Platform Supported
18.1.2 Resources
18.1.3 Parameters
18.1.4 Prerequisites
18.1.5 Grammatical Relationships
18.2 RASP Parser
18.3 SUPPLE Parser
18.3.1 Requirements
18.3.2 Building SUPPLE
18.3.3 Running the Parser in GATE
18.3.4 Viewing the Parse Tree
18.3.5 System Properties
18.3.6 Configuration Files
18.3.7 Parser and Grammar
18.3.8 Mapping Named Entities
18.3.9 Upgrading from BuChart to SUPPLE
18.4 Stanford Parser
18.4.1 Input Requirements
18.4.2 Initialization Parameters
18.4.3 Runtime Parameters
19 Machine Learning
19.1 ML Generalities
19.1.1 Some Definitions
19.1.2 GATE-Specific Interpretation of the Above Definitions
19.2 Batch Learning PR
19.2.1 Batch Learning PR Configuration File Settings
19.2.2 Case Studies for the Three Learning Types
19.2.3 How to Use the Batch Learning PR in GATE Developer
19.2.4 Output of the Batch Learning PR
19.2.5 Using the Batch Learning PR from the API
19.3 Machine Learning PR
19.3.1 The DATASET Element
19.3.2 The ENGINE Element
19.3.3 The WEKA Wrapper
19.3.4 The MAXENT Wrapper
19.3.5 The SVM Light Wrapper
19.3.6 Example Configuration File
20 Tools for Alignment Tasks
20.1 Introduction
20.2 The Tools
20.2.1 Compound Document
20.2.2 CompoundDocumentFromXml
20.2.3 Compound Document Editor
20.2.4 Composite Document
20.2.5 DeleteMembersPR
20.2.6 SwitchMembersPR
20.2.7 Saving as XML
20.2.8 Alignment Editor
20.2.9 Saving Files and Alignments
20.2.10 Section-by-Section Processing
21 Crowdsourcing Data with GATE
21.1 The Basics
21.2 Entity classification
21.2.1 Creating a classification job
21.2.2 Loading data into a job
21.2.3 Importing the results
21.3 Entity annotation
21.3.1 Creating an annotation job
21.3.2 Loading data into a job
21.3.3 Importing the results
22 Combining GATE and UIMA
22.1 Embedding a UIMA AE in GATE
22.1.1 Mapping File Format
22.1.2 The UIMA Component Descriptor
22.1.3 Using the AnalysisEnginePR
22.2 Embedding a GATE CorpusController in UIMA
22.2.1 Mapping File Format
22.2.2 The GATE Application Definition
22.2.3 Configuring the GATEApplicationAnnotator
23 More (CREOLE) Plugins
23.1 Verb Group Chunker
23.2 Noun Phrase Chunker
23.2.1 Differences from the Original
23.2.2 Using the Chunker
23.3 TaggerFramework
23.3.1 TreeTagger—Multilingual POS Tagger
23.3.2 GENIA and Double Quotes
23.4 Chemistry Tagger
23.4.1 Using the Tagger
23.5 Zemanta Semantic Annotation Service
23.6 Lupedia Semantic Annotation Service
23.7 TextRazor Annotation Service
23.8 Annotating Numbers
23.8.1 Numbers in Words and Numbers
23.8.2 Roman Numerals
23.9 Annotating Measurements
23.10 Annotating and Normalizing Dates
23.11 Snowball Based Stemmers
23.11.1 Algorithms
23.12 GATE Morphological Analyzer
23.12.1 Rule File
23.13 Flexible Exporter
23.14 Configurable Exporter
23.15 Annotation Set Transfer
23.16 Schema Enforcer
23.17 Information Retrieval in GATE
23.17.1 Using the IR Functionality in GATE
23.17.2 Using the IR API
23.18 Websphinx Web Crawler
23.18.1 Using the Crawler PR
23.18.2 Proxy configuration
23.19 WordNet in GATE
23.19.1 The WordNet API
23.20 Kea - Automatic Keyphrase Detection
23.20.1 Using the ‘KEA Keyphrase Extractor’ PR
23.20.2 Using Kea Corpora
23.21 Annotation Merging Plugin
23.22 Copying Annotations between Documents
23.23 OpenCalais Plugin
23.24 LingPipe Plugin
23.24.1 LingPipe Tokenizer PR
23.24.2 LingPipe Sentence Splitter PR
23.24.3 LingPipe POS Tagger PR
23.24.4 LingPipe NER PR
23.24.5 LingPipe Language Identifier PR
23.25 OpenNLP Plugin
23.25.1 Init parameters and models
23.25.2 OpenNLP PRs
23.25.3 Obtaining and generating models
23.26 Stanford Part-of-Speech Tagger
23.27 Content Detection Using Boilerpipe
23.28 Inter Annotator Agreement
23.29 Schema Annotation Editor
23.30 Coref Tools Plugin
23.31 Pubmed Format
23.32 MediaWiki Format
23.33 Fast Infoset Document Format
23.34 CSV Document Support
23.35 TermRaider term extraction tools
23.35.1 Termbank language resources
23.35.2 Termbank Score Copier
23.35.3 The PMI bank language resource
23.36 Document Normalizer
23.37 Developer Tools
IV The GATE Family: Cloud, MIMIR, Teamware
24 GATE Cloud
24.1 GATE Cloud services: an overview
24.2 Comparison with other systems
24.3 How to buy services
24.4 Pricing and discounts
24.5 Annotation Jobs on GATECloud.net
24.5.1 The Annotation Service Charges Explained
24.5.2 Annotation Job Execution in Detail
24.6 Running Custom Annotation Jobs on GATECloud.net
24.6.1 Preparing Your Application: The Basics
24.6.2 The GATECloud.net environment
25 GATE Teamware: A Web-based Collaborative Corpus Annotation Tool
25.1 Introduction
25.2 Requirements for Multi-Role Collaborative Annotation Environments
25.2.1 Typical Division of Labour
25.2.2 Remote, Scalable Data Storage
25.2.3 Automatic annotation services
25.2.4 Workflow Support
25.3 Teamware: Architecture, Implementation, and Examples
25.3.1 Data Storage Service
25.3.2 Annotation Services
25.3.3 The Executive Layer
25.3.4 The User Interfaces
25.4 Practical Applications
26 GATE Mímir
Appendices
A Change Log
A.1 Version 8.0 (May 2014)
A.1.1 Major changes
A.1.2 Other new and improved plugins
A.1.3 Bug fixes and other improvements
A.1.4 For developers
A.2 Version 7.1 (November 2012)
A.2.1 New plugins
A.2.2 Library updates
A.2.3 GATE Embedded API changes
A.3 Version 7.0 (February 2012)
A.3.1 Major new features
A.3.2 Removal of deprecated functionality
A.3.3 Other enhancements and bug fixes
A.4 Version 6.1 (April 2011)
A.4.1 New CREOLE Plugins
A.4.2 Other new features and improvements
A.5 Version 6.0 (November 2010)
A.5.1 Major new features
A.5.2 Breaking changes
A.5.3 Other new features and bugfixes
A.6 Version 5.2.1 (May 2010)
A.7 Version 5.2 (April 2010)
A.7.1 JAPE and JAPE-related
A.7.2 Other Changes
A.8 Version 5.1 (December 2009)
A.8.1 New Features
A.8.2 JAPE improvements
A.8.3 Other improvements and bug fixes
A.9 Version 5.0 (May 2009)
A.9.1 Major New Features
A.9.2 Other New Features and Improvements
A.9.3 Specific Bug Fixes
A.10 Version 4.0 (July 2007)
A.10.1 Major New Features
A.10.2 Other New Features and Improvements
A.10.3 Bug Fixes and Optimizations
A.11 Version 3.1 (April 2006)
A.11.1 Major New Features
A.11.2 Other New Features and Improvements
A.11.3 Bug Fixes
A.12 January 2005
A.13 December 2004
A.14 September 2004
A.15 Version 3 Beta 1 (August 2004)
A.16 July 2004
A.17 June 2004
A.18 April 2004
A.19 March 2004
A.20 Version 2.2 – August 2003
A.21 Version 2.1 – February 2003
A.22 June 2002
B Version 5.1 Plugins Name Map
C Obsolete CREOLE Plugins
C.1 Ontotext JapeC Compiler
C.2 Google Plugin
C.3 Yahoo Plugin
C.3.1 Using the YahooPR
C.4 Gazetteer Visual Resource - GAZE
C.4.1 Display Modes
C.4.2 Linear Definition Pane
C.4.3 Linear Definition Toolbar
C.4.4 Operations on Linear Definition Nodes
C.4.5 Gazetteer List Pane
C.4.6 Mapping Definition Pane
C.5 Google Translator PR
D Design Notes
D.1 Patterns
D.1.1 Components
D.1.2 Model, view, controller
D.1.3 Interfaces
D.2 Exception Handling
E Ant Tasks for GATE
E.1 Declaring the Tasks
E.2 The packagegapp task - bundling an application with its dependencies
E.2.1 Introduction
E.2.2 Basic Usage
E.2.3 Handling Non-Plugin Resources
E.2.4 Streamlining your Plugins
E.2.5 Bundling Extra Resources
E.3 The expandcreoles Task - Merging Annotation-Driven Config into creole.xml
F Named-Entity State Machine Patterns
F.1 Main.jape
F.2 first.jape
F.3 firstname.jape
F.4 name.jape
F.4.1 Person
F.4.2 Location
F.4.3 Organization
F.4.4 Ambiguities
F.4.5 Contextual information
F.5 name_post.jape
F.6 date_pre.jape
F.7 date.jape
F.8 reldate.jape
F.9 number.jape
F.10 address.jape
F.11 url.jape
F.12 identifier.jape
F.13 jobtitle.jape
F.14 final.jape
F.15 unknown.jape
F.16 name_context.jape
F.17 org_context.jape
F.18 loc_context.jape
F.19 clean.jape
G Part-of-Speech Tags used in the Hepple Tagger
H Copyright and Licence
1 Introduction
1.1 How to Use this Text
1.2 Context
1.3 Overview
1.3.1 Developing and Deploying Language Processing Facilities
1.3.2 Built-In Components
1.3.3 Additional Facilities in GATE Developer/Embedded
1.3.4 An Example
1.4 Some Evaluations
1.5 Recent Changes
1.5.1 Version 8.0 (May 2014)
1.6 Further Reading
2 Installing and Running GATE
2.1 Downloading GATE
2.2 Installing and Running GATE
2.2.1 The Easy Way
2.2.2 The Hard Way (1)
2.2.3 The Hard Way (2): Subversion
2.2.4 Running GATE Developer on Unix/Linux
2.3 Using System Properties with GATE
2.4 Changing GATE’s launch configuration
2.5 Configuring GATE
2.6 Building GATE
2.6.1 Using GATE with Maven/Ivy
2.7 Uninstalling GATE
2.8 Troubleshooting
3 Using GATE Developer
3.1 The GATE Developer Main Window
3.2 Loading and Viewing Documents
3.3 Creating and Viewing Corpora
3.4 Working with Annotations
3.4.1 The Annotation Sets View
3.4.2 The Annotations List View
3.4.3 The Annotations Stack View
3.4.4 The Co-reference Editor
3.4.5 Creating and Editing Annotations
3.4.6 Schema-Driven Editing
3.4.7 Printing Text with Annotations
3.5 Using CREOLE Plugins
3.6 Installing and updating CREOLE Plugins
3.7 Loading and Using Processing Resources
3.8 Creating and Running an Application
3.8.1 Running an Application on a Datastore
3.8.2 Running PRs Conditionally on Document Features
3.8.3 Doing Information Extraction with ANNIE
3.8.4 Modifying ANNIE
3.9 Saving Applications and Language Resources
3.9.1 Saving Documents to File
3.9.2 Saving and Restoring LRs in Datastores
3.9.3 Saving Application States to a File
3.9.4 Saving an Application with its Resources (e.g. GATECloud.net)
3.10 Keyboard Shortcuts
3.11 Miscellaneous
3.11.1 Stopping GATE from Restoring Developer Sessions/Options
3.11.2 Working with Unicode
4 CREOLE: the GATE Component Model
4.1 The Web and CREOLE
4.2 The GATE Framework
4.3 The Lifecycle of a CREOLE Resource
4.4 Processing Resources and Applications
4.5 Language Resources and Datastores
4.6 Built-in CREOLE Resources
4.7 CREOLE Resource Configuration
4.7.1 Configuration with XML
4.7.2 Configuring Resources using Annotations
4.7.3 Mixing the Configuration Styles
4.7.4 Loading Third-Party Libraries using Apache Ivy
4.8 Tools: How to Add Utilities to GATE Developer
4.8.1 Putting Your Tools in a Sub-Menu
4.8.2 Adding Tools To Existing Resource Types
5 Language Resources: Corpora, Documents and Annotations
5.1 Features: Simple Attribute/Value Data
5.2 Corpora: Sets of Documents plus Features
5.3 Documents: Content plus Annotations plus Features
5.4 Annotations: Directed Acyclic Graphs
5.4.1 Annotation Schemas
5.4.2 Examples of Annotated Documents
5.4.3 Creating, Viewing and Editing Diverse Annotation Types
5.5 Document Formats
5.5.1 Detecting the Right Reader
5.5.2 XML
5.5.3 HTML
5.5.4 SGML
5.5.5 Plain text
5.5.6 RTF
5.5.7 Email
5.5.8 PDF Files and Office Documents
5.5.9 UIMA CAS Documents
5.5.10 CoNLL/IOB Documents
5.6 XML Input/Output
6 ANNIE: a Nearly-New Information Extraction System
6.1 Document Reset
6.2 Tokeniser
6.2.1 Tokeniser Rules
6.2.2 Token Types
6.2.3 English Tokeniser
6.3 Gazetteer
6.4 Sentence Splitter
6.5 RegEx Sentence Splitter
6.6 Part of Speech Tagger
6.7 Semantic Tagger
6.8 Orthographic Coreference (OrthoMatcher)
6.8.1 GATE Interface
6.8.2 Resources
6.8.3 Processing
6.9 Pronominal Coreference
6.9.1 Quoted Speech Submodule
6.9.2 Pleonastic It Submodule
6.9.3 Pronominal Resolution Submodule
6.9.4 Detailed Description of the Algorithm
6.10 A Walk-Through Example
6.10.1 Step 1 - Tokenisation
6.10.2 Step 2 - List Lookup
6.10.3 Step 3 - Grammar Rules
II GATE for Advanced Users
7 GATE Embedded
7.1 Quick Start with GATE Embedded
7.2 Resource Management in GATE Embedded
7.3 Using CREOLE Plugins
7.4 Language Resources
7.4.1 GATE Documents
7.4.2 Feature Maps
7.4.3 Annotation Sets
7.4.4 Annotations
7.4.5 GATE Corpora
7.5 Processing Resources
7.6 Controllers
7.7 Modelling Relations between Annotations
7.8 Duplicating a Resource
7.8.1 Sharable properties
7.9 Persistent Applications
7.10 Ontologies
7.11 Creating a New Annotation Schema
7.12 Creating a New CREOLE Resource
7.13 Adding Support for a New Document Format
7.14 Using GATE Embedded in a Multithreaded Environment
7.15 Using GATE Embedded within a Spring Application
7.15.1 Duplication in Spring
7.15.2 Spring pooling
7.15.3 Further reading
7.16 Using GATE Embedded within a Tomcat Web Application
7.16.1 Recommended Directory Structure
7.16.2 Configuration Files
7.16.3 Initialization Code
7.17 Groovy for GATE
7.17.1 Groovy Scripting Console for GATE
7.17.2 Groovy scripting PR
7.17.3 The Scriptable Controller
7.17.4 Utility methods
7.18 Saving Config Data to gate.xml
7.19 Annotation merging through the API
7.20 Using Resource Helpers to Extend the API
8 JAPE: Regular Expressions over Annotations
8.1 The Left-Hand Side
8.1.1 Matching Entire Annotation Types
8.1.2 Using Features and Values
8.1.3 Using Meta-Properties
8.1.4 Building complex patterns from simple patterns
8.1.5 Matching a Simple Text String
8.1.6 Using Templates
8.1.7 Multiple Pattern/Action Pairs
8.1.8 LHS Macros
8.1.9 Multi-Constraint Statements
8.1.10 Using Context
8.1.11 Negation
8.1.12 Escaping Special Characters
8.2 LHS Operators in Detail
8.2.1 Equality Operators
8.2.2 Comparison Operators
8.2.3 Regular Expression Operators
8.2.4 Contextual Operators
8.2.5 Custom Operators
8.3 The Right-Hand Side
8.3.1 A Simple Example
8.3.2 Copying Feature Values from the LHS to the RHS
8.3.3 Optional or Empty Labels
8.3.4 RHS Macros
8.4 Use of Priority
8.5 Using Phases Sequentially
8.6 Using Java Code on the RHS
8.6.1 A More Complex Example
8.6.2 Adding a Feature to the Document
8.6.3 Finding the Tokens of a Matched Annotation
8.6.4 Using Named Blocks
8.6.5 Java RHS Overview
8.7 Optimising for Speed
8.8 Ontology Aware Grammar Transduction
8.9 Serializing JAPE Transducer
8.9.1 How to Serialize?
8.9.2 How to Use the Serialized Grammar File?
8.10 Notes for Montreal Transducer Users
8.11 JAPE Plus
9 ANNIC: ANNotations-In-Context
9.1 Instantiating SSD
9.2 Search GUI
9.2.1 Overview
9.2.2 Syntax of Queries
9.2.3 Top Section
9.2.4 Central Section
9.2.5 Bottom Section
9.3 Using SSD from GATE Embedded
9.3.1 How to instantiate a searchabledatastore
9.3.2 How to search in this datastore
10 Performance Evaluation of Language Analysers
10.1 Metrics for Evaluation in Information Extraction
10.1.1 Annotation Relations
10.1.2 Cohen’s Kappa
10.1.3 Precision, Recall, F-Measure
10.1.4 Macro and Micro Averaging
10.2 The Annotation Diff Tool
10.2.1 Performing Evaluation with the Annotation Diff Tool
10.2.2 Creating a Gold Standard with the Annotation Diff Tool
10.3 Corpus Quality Assurance
10.3.1 Description of the interface
10.3.2 Step by step usage
10.3.3 Details of the Corpus statistics table
10.3.4 Details of the Document statistics table
10.3.5 GATE Embedded API for the measures
10.3.6 sec:eval:qapr
10.4 Corpus Benchmark Tool
10.4.1 Preparing the Corpora for Use
10.4.2 Defining Properties
10.4.3 Running the Tool
10.4.4 The Results
10.5 A Plugin Computing Inter-Annotator Agreement (IAA)
10.5.1 IAA for Classification
10.5.2 IAA For Named Entity Annotation
10.5.3 The BDM-Based IAA Scores
10.6 A Plugin Computing the BDM Scores for an Ontology
10.7 Quality Assurance Summariser for Teamware
11 Profiling Processing Resources
11.1 Overview
11.1.1 Features
11.1.2 Limitations
11.2 Graphical User Interface
11.3 Command Line Interface
11.4 Application Programming Interface
11.4.1 Log4j.properties
11.4.2 Benchmark log format
11.4.3 Enabling profiling
11.4.4 Reporting tool
12 Developing GATE
12.1 Reporting Bugs and Requesting Features
12.2 Contributing Patches
12.3 Creating New Plugins
12.3.1 What to Call your Plugin
12.3.2 Writing a New PR
12.3.3 Writing a New VR
12.3.4 Writing a ‘Ready Made’ Application
12.3.5 Distributing Your New Plugins
12.4 Updating this User Guide
12.4.1 Building the User Guide
12.4.2 Making Changes to the User Guide
III CREOLE Plugins
13 Gazetteers
13.1 Introduction to Gazetteers
13.2 ANNIE Gazetteer
13.2.1 Creating and Modifying Gazetteer Lists
13.2.2 ANNIE Gazetteer Editor
13.3 OntoGazetteer
13.4 Gaze Ontology Gazetteer Editor
13.4.1 The Gaze Gazetteer List and Mapping Editor
13.4.2 The Gaze Ontology Editor
13.5 Hash Gazetteer
13.5.1 Prerequisites
13.5.2 Parameters
13.6 Flexible Gazetteer
13.7 Gazetteer List Collector
13.8 OntoRoot Gazetteer
13.8.1 How Does it Work?
13.8.2 Initialisation of OntoRoot Gazetteer
13.8.3 Simple steps to run OntoRoot Gazetteer
13.9 Large KB Gazetteer
13.9.1 Quick usage overview
13.9.2 Dictionary setup
13.9.3 Additional dictionary configuration
13.9.4 Dictionary for Gazetteer List Files
13.9.5 Processing Resource Configuration
13.9.6 Runtime configuration
13.9.7 Semantic Enrichment PR
13.10 The Shared Gazetteer for multithreaded processing
14 Working with Ontologies
14.1 Data Model for Ontologies
14.1.1 Hierarchies of Classes and Restrictions
14.1.2 Instances
14.1.3 Hierarchies of Properties
14.1.4 URIs
14.2 Ontology Event Model
14.2.1 What Happens when a Resource is Deleted?
14.3 The Ontology Plugin: Current Implementation
14.3.1 The OWLIMOntology Language Resource
14.3.2 The ConnectSesameOntology Language Resource
14.3.3 The CreateSesameOntology Language Resource
14.3.4 The OWLIM2 Backwards-Compatible Language Resource
14.3.5 Using Ontology Import Mappings
14.3.6 Using BigOWLIM
14.3.7 The sesameCLI command line interface
14.4 The Ontology_OWLIM2 plugin: backwards-compatible implementation
14.4.1 The OWLIMOntologyLR Language Resource
14.5 GATE Ontology Editor
14.6 Ontology Annotation Tool
14.6.1 Viewing Annotated Text
14.6.2 Editing Existing Annotations
14.6.3 Adding New Annotations
14.6.4 Options
14.7 Relation Annotation Tool
14.7.1 Description of the two views
14.7.2 Create new annotation and instance from text selection
14.7.3 Create new annotation and add label to existing instance from text selection
14.7.4 Create and set properties for annotation relation
14.7.5 Delete instance, label or property
14.7.6 Differences with OAT and Ontology Editor
14.8 Using the ontology API
14.9 Using the ontology API (old version)
14.10 Ontology-Aware JAPE Transducer
14.11 Annotating Text with Ontological Information
14.12 Populating Ontologies
14.13 Ontology API and Implementation Changes
14.13.1 Differences between the implementation plugins
14.13.2 Changes in the Ontology API
15 Non-English Language Support
15.1 Language Identification
15.1.1 Fingerprint Generation
15.2 French Plugin
15.3 German Plugin
15.4 Romanian Plugin
15.5 Arabic Plugin
15.6 Chinese Plugin
15.6.1 Chinese Word Segmentation
15.7 Hindi Plugin
15.8 Russian Plugin
15.9 Bulgarian Plugin
16 Domain Specific Resources
16.1 Biomedical Support
16.1.1 ABNER
16.1.2 MetaMap
16.1.3 GSpell biomedical spelling suggestion and correction
16.1.4 BADREX
16.1.5 MiniChem/Drug Tagger
16.1.6 AbGene
16.1.7 GENIA
16.1.8 Penn BioTagger
16.1.9 MutationFinder
16.1.10 NormaGene
17 Tools for Social Media Data
17.1 Tools for Twitter
17.1.1 Twitter JSON format
17.2 Low-level PRs for Tweets
17.3 Handling multi-word hashtags
17.4 The TwitIE Pipeline
18 Parsers
18.1 MiniPar Parser
18.1.1 Platform Supported
18.1.2 Resources
18.1.3 Parameters
18.1.4 Prerequisites
18.1.5 Grammatical Relationships
18.2 RASP Parser
18.3 SUPPLE Parser
18.3.1 Requirements
18.3.2 Building SUPPLE
18.3.3 Running the Parser in GATE
18.3.4 Viewing the Parse Tree
18.3.5 System Properties
18.3.6 Configuration Files
18.3.7 Parser and Grammar
18.3.8 Mapping Named Entities
18.3.9 Upgrading from BuChart to SUPPLE
18.4 Stanford Parser
18.4.1 Input Requirements
18.4.2 Initialization Parameters
18.4.3 Runtime Parameters
19 Machine Learning
19.1 ML Generalities
19.1.1 Some Definitions
19.1.2 GATE-Specific Interpretation of the Above Definitions
19.2 Batch Learning PR
19.2.1 Batch Learning PR Configuration File Settings
19.2.2 Case Studies for the Three Learning Types
19.2.3 How to Use the Batch Learning PR in GATE Developer
19.2.4 Output of the Batch Learning PR
19.2.5 Using the Batch Learning PR from the API
19.3 Machine Learning PR
19.3.1 The DATASET Element
19.3.2 The ENGINE Element
19.3.3 The WEKA Wrapper
19.3.4 The MAXENT Wrapper
19.3.5 The SVM Light Wrapper
19.3.6 Example Configuration File
20 Tools for Alignment Tasks
20.1 Introduction
20.2 The Tools
20.2.1 Compound Document
20.2.2 CompoundDocumentFromXml
20.2.3 Compound Document Editor
20.2.4 Composite Document
20.2.5 DeleteMembersPR
20.2.6 SwitchMembersPR
20.2.7 Saving as XML
20.2.8 Alignment Editor
20.2.9 Saving Files and Alignments
20.2.10 Section-by-Section Processing
21 Crowdsourcing Data with GATE
21.1 The Basics
21.2 Entity classification
21.2.1 Creating a classification job
21.2.2 Loading data into a job
21.2.3 Importing the results
21.3 Entity annotation
21.3.1 Creating an annotation job
21.3.2 Loading data into a job
21.3.3 Importing the results
22 Combining GATE and UIMA
22.1 Embedding a UIMA AE in GATE
22.1.1 Mapping File Format
22.1.2 The UIMA Component Descriptor
22.1.3 Using the AnalysisEnginePR
22.2 Embedding a GATE CorpusController in UIMA
22.2.1 Mapping File Format
22.2.2 The GATE Application Definition
22.2.3 Configuring the GATEApplicationAnnotator
23 More (CREOLE) Plugins
23.1 Verb Group Chunker
23.2 Noun Phrase Chunker
23.2.1 Differences from the Original
23.2.2 Using the Chunker
23.3 TaggerFramework
23.3.1 TreeTagger—Multilingual POS Tagger
23.3.2 GENIA and Double Quotes
23.4 Chemistry Tagger
23.4.1 Using the Tagger
23.5 Zemanta Semantic Annotation Service
23.6 Lupedia Semantic Annotation Service
23.7 TextRazor Annotation Service
23.8 Annotating Numbers
23.8.1 Numbers in Words and Numbers
23.8.2 Roman Numerals
23.9 Annotating Measurements
23.10 Annotating and Normalizing Dates
23.11 Snowball Based Stemmers
23.11.1 Algorithms
23.12 GATE Morphological Analyzer
23.12.1 Rule File
23.13 Flexible Exporter
23.14 Configurable Exporter
23.15 Annotation Set Transfer
23.16 Schema Enforcer
23.17 Information Retrieval in GATE
23.17.1 Using the IR Functionality in GATE
23.17.2 Using the IR API
23.18 Websphinx Web Crawler
23.18.1 Using the Crawler PR
23.18.2 Proxy configuration
23.19 WordNet in GATE
23.19.1 The WordNet API
23.20 Kea - Automatic Keyphrase Detection
23.20.1 Using the ‘KEA Keyphrase Extractor’ PR
23.20.2 Using Kea Corpora
23.21 Annotation Merging Plugin
23.22 Copying Annotations between Documents
23.23 OpenCalais Plugin
23.24 LingPipe Plugin
23.24.1 LingPipe Tokenizer PR
23.24.2 LingPipe Sentence Splitter PR
23.24.3 LingPipe POS Tagger PR
23.24.4 LingPipe NER PR
23.24.5 LingPipe Language Identifier PR
23.25 OpenNLP Plugin
23.25.1 Init parameters and models
23.25.2 OpenNLP PRs
23.25.3 Obtaining and generating models
23.26 Stanford Part-of-Speech Tagger
23.27 Content Detection Using Boilerpipe
23.28 Inter Annotator Agreement
23.29 Schema Annotation Editor
23.30 Coref Tools Plugin
23.31 Pubmed Format
23.32 MediaWiki Format
23.33 Fast Infoset Document Format
23.34 CSV Document Support
23.35 TermRaider term extraction tools
23.35.1 Termbank language resources
23.35.2 Termbank Score Copier
23.35.3 The PMI bank language resource
23.36 Document Normalizer
23.37 Developer Tools
IV The GATE Family: Cloud, MIMIR, Teamware
24 GATE Cloud
24.1 GATE Cloud services: an overview
24.2 Comparison with other systems
24.3 How to buy services
24.4 Pricing and discounts
24.5 Annotation Jobs on GATECloud.net
24.5.1 The Annotation Service Charges Explained
24.5.2 Annotation Job Execution in Detail
24.6 Running Custom Annotation Jobs on GATECloud.net
24.6.1 Preparing Your Application: The Basics
24.6.2 The GATECloud.net environment
25 GATE Teamware: A Web-based Collaborative Corpus Annotation Tool
25.1 Introduction
25.2 Requirements for Multi-Role Collaborative Annotation Environments
25.2.1 Typical Division of Labour
25.2.2 Remote, Scalable Data Storage
25.2.3 Automatic annotation services
25.2.4 Workflow Support
25.3 Teamware: Architecture, Implementation, and Examples
25.3.1 Data Storage Service
25.3.2 Annotation Services
25.3.3 The Executive Layer
25.3.4 The User Interfaces
25.4 Practical Applications
26 GATE Mímir
Appendices
A Change Log
A.1 Version 8.0 (May 2014)
A.1.1 Major changes
A.1.2 Other new and improved plugins
A.1.3 Bug fixes and other improvements
A.1.4 For developers
A.2 Version 7.1 (November 2012)
A.2.1 New plugins
A.2.2 Library updates
A.2.3 GATE Embedded API changes
A.3 Version 7.0 (February 2012)
A.3.1 Major new features
A.3.2 Removal of deprecated functionality
A.3.3 Other enhancements and bug fixes
A.4 Version 6.1 (April 2011)
A.4.1 New CREOLE Plugins
A.4.2 Other new features and improvements
A.5 Version 6.0 (November 2010)
A.5.1 Major new features
A.5.2 Breaking changes
A.5.3 Other new features and bugfixes
A.6 Version 5.2.1 (May 2010)
A.7 Version 5.2 (April 2010)
A.7.1 JAPE and JAPE-related
A.7.2 Other Changes
A.8 Version 5.1 (December 2009)
A.8.1 New Features
A.8.2 JAPE improvements
A.8.3 Other improvements and bug fixes
A.9 Version 5.0 (May 2009)
A.9.1 Major New Features
A.9.2 Other New Features and Improvements
A.9.3 Specific Bug Fixes
A.10 Version 4.0 (July 2007)
A.10.1 Major New Features
A.10.2 Other New Features and Improvements
A.10.3 Bug Fixes and Optimizations
A.11 Version 3.1 (April 2006)
A.11.1 Major New Features
A.11.2 Other New Features and Improvements
A.11.3 Bug Fixes
A.12 January 2005
A.13 December 2004
A.14 September 2004
A.15 Version 3 Beta 1 (August 2004)
A.16 July 2004
A.17 June 2004
A.18 April 2004
A.19 March 2004
A.20 Version 2.2 – August 2003
A.21 Version 2.1 – February 2003
A.22 June 2002
B Version 5.1 Plugins Name Map
C Obsolete CREOLE Plugins
C.1 Ontotext JapeC Compiler
C.2 Google Plugin
C.3 Yahoo Plugin
C.3.1 Using the YahooPR
C.4 Gazetteer Visual Resource - GAZE
C.4.1 Display Modes
C.4.2 Linear Definition Pane
C.4.3 Linear Definition Toolbar
C.4.4 Operations on Linear Definition Nodes
C.4.5 Gazetteer List Pane
C.4.6 Mapping Definition Pane
C.5 Google Translator PR
D Design Notes
D.1 Patterns
D.1.1 Components
D.1.2 Model, view, controller
D.1.3 Interfaces
D.2 Exception Handling
E Ant Tasks for GATE
E.1 Declaring the Tasks
E.2 The packagegapp task - bundling an application with its dependencies
E.2.1 Introduction
E.2.2 Basic Usage
E.2.3 Handling Non-Plugin Resources
E.2.4 Streamlining your Plugins
E.2.5 Bundling Extra Resources
E.3 The expandcreoles Task - Merging Annotation-Driven Config into creole.xml
F Named-Entity State Machine Patterns
F.1 Main.jape
F.2 first.jape
F.3 firstname.jape
F.4 name.jape
F.4.1 Person
F.4.2 Location
F.4.3 Organization
F.4.4 Ambiguities
F.4.5 Contextual information
F.5 name_post.jape
F.6 date_pre.jape
F.7 date.jape
F.8 reldate.jape
F.9 number.jape
F.10 address.jape
F.11 url.jape
F.12 identifier.jape
F.13 jobtitle.jape
F.14 final.jape
F.15 unknown.jape
F.16 name_context.jape
F.17 org_context.jape
F.18 loc_context.jape
F.19 clean.jape
G Part-of-Speech Tags used in the Hepple Tagger
H Copyright and Licence