GATE: a full-lifecycle open source solution for text processing
(Impatient? See the 2-minute guide.)
GATE is over 15 years old and is in active use for all types of computational task involving human language. GATE excels at text analysis of all shapes and sizes. From large corporations to small startups, from €multi-million research consortia to undergraduate projects, our user community is the largest and most diverse of any system of this type, and is spread across all but one of the continents1.
GATE is open source free software; users can obtain free support from the user and developer community via GATE.ac.uk or on a commercial basis from our industrial partners. We are the biggest open source language processing project with a development team more than double the size of the largest comparable projects (many of which are integrated with GATE2). More than €5 million has been invested in GATE development3; our objective is to make sure that this continues to be money well spent for all GATE's users.
This note summarises the GATE software and process and gives examples of some of their uses. We believe that GATE is the leading system of its type, but as scientists we have to advise you not to take our word for it; that's why we've measured our software in many of the competitive evaluations over the last decade-and-a-half (MUC, TREC, ACE, DUC, ...). We invite you to give it a try, to get involved with the GATE community, and to contribute to human language science, engineering and development.
2. The GATE Family
GATE has grown over the years to include a desktop client for developers, a workflow-based web application, a Java library, an architecture and a process. GATE is:
- an IDE, GATE Developer4: an integrated development environment for language processing components bundled with a very widely used Information Extraction system and a comprehensive set of other plugins
- a web app: GATE Teamware a collaborative annotation environment for factory-style semantic annotation projects built around a workflow engine and a heavily-optimised backend service infrastructure
- a framework, GATE Embedded: an object library optimised for inclusion in diverse applications giving access to all the services used by GATE Developer and more
- an architecture: a high-level organisational picture of how language processing software composition
- a process for the creation of robust and maintainable services
We also develop:
- a cloud computing solution for hosted large-scale text processing (GATE Cloud.net)
- GATE Mímir: (Multi-paradigm Information Management Index and Repository) a massively scalable multiparadigm index built on Ontotext's semantic repository family, GATE's annotation structures database plus full-text indexing from MG4J
- a wiki/CMS (GATE Wiki.sf.net), mainly to host our own websites and as a testbed for some of our experiments
For more information see the family pages.
One of our original motivations was to remove the necessity for solving common engineering problems before doing useful research, or re-engineering before deploying research results into applications. Core functions of GATE take care of the lion's share of the engineering:
- modelling and persistence of specialised data structures
- measurement, evaluation, benchmarking (never believe a computing researcher who hasn't measured their results in a repeatable and open setting!)
- visualisation and editing of annotations, ontologies, parse trees, etc.
- a finite state transduction language for rapid prototyping and efficient implementation of shallow analysis methods (JAPE)
- extraction of training instances for machine learning
- pluggable machine learning implementations (Weka, YALE, SVM Lite, ...)
On top of the core functions GATE includes components for diverse language processing tasks, e.g. parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. GATE Developer and Embedded are supplied with an Information Extraction system (ANNIE) which has been adapted and evaluated very widely (numerous industrial systems, research systems evaluated in MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.). ANNIE is often used to create RDF or OWL (metadata) for unstructured content (semantic annotation).
GATE version 1 was written in the mid-1990s; at the turn of the new millennium we completely rewrote the system in Java; version 5 was released in June 2009.
2.1. Component model
One of the reasons GATE has lasted well and been successful is that the entire core is broken down into reusable chunks (using the original Java component model). Some of the APIs available in Embedded are summarised here:
3. First Cousins - the Ontotext Family
Complementing GATE's development and collaborative distributed annotation tools, KIM provides a straightforward deployment option (front-end, back-end).
- Ontotext KIM: multiparadigm search UIs for information management, navigation and search including KIM conceptual query, KIM CORE, and the ANNIC Annotations in Context tool
Many systems developed with GATE are embedded into existing applications of one sort or another; the Ontotext family provide a good alternative to this approach, and GATE-based annotation with a KIM/Mímir index and search engine represents a robust and mature solution for text analysis for enterprise search and similar.
4. Where next?
Hungry for more? A summary of the main sources of documentation and where to get help:
- key documentation from Sheffield
- documentation elsewhere
- the mailing list
- it's hell being a nerd... just when you've figured out wikis...
- send us 3 years' pocket money and your stamp collection
- Rumours that we're planning to send several of the development team to Antarctica on one-way tickets are, of course, false, libellous and wishful thinking.
- Our philosophy is reuse not reinvention, so we integrate and interoperate with other systems, e.g.: LingPipe, OpenNLP, UIMA, and many more specific tools.
- This is the figure for direct Sheffield-based investment only and therefore an underestimate.
- GATE Developer and GATE Embedded are bundled, and in older distributions were referred to just as "GATE".