GATE: releasing the missing links
Friday May 6th 2011
Contents
1. Introduction: Open Source Enterprise Search
The GATE team (gate.ac.uk) today release three new products:
- GATECloud.net, a scaleable pay-as-you-go service that makes GATE available in SaaS (Software as a Service) mode
- Mímir, an index server that unifies full text boolean search, annotation graph search and SPARQL semantic search
- Teamware, a collaborative annotation & curation environment for harnessing a distributed workforce and monitoring progress & results remotely in real time
All three are released under the open source AGPL licence (commercial options also available).
With this release GATE becomes the only open source solution that covers the entire text analysis and search lifecycle.
Now you can do enterprise search, business intelligence, voice of the customer, web mining, unstructured data management, scientific literature analysis etc. etc. etc. with a mature, 100% open source solution.
Also in this release:
- Version 6.1 of GATE Developer and GATE Embedded -- see the change log for details.
GATE Developer and GATE Embedded are available for download here; other members of the family are available on http://gatecloud.net/ and from our Sourceforge pages. More details: the GATE family.
2. GATECloud.net
On GATECloud.net you can use GATE on the Amazon Elastic Compute Cloud via a simple point-and-click web tool to:
- run analysis processes using annotation tools provided by the GATE team
- upload your own analysis processes (created using GATE Developer)
- push the results into a Mímir index server
- manage annotation projects using GATE Teamware
The benefits of a cloud solution include:
- zero fixed costs: you don't buy software licences or server hardware, just pay for the compute time that you use
- near zero startup time: in a matter of minutes you can specify, provision and deploy the type of computation that used to take months of planning
- easy in, easy out: if you try it and don't like it, go elsewhere! you can even take the software with you, it's all open source
- someone else takes the admin load:
- the GATE team from the University of Sheffield make sure you're running the best of breed technology for text, search and semantics
- cloud providers' data center managers (e.g. at Amazon Inc.) make sure the hardware and operating platform for your work is scaleable, reliable and cheap
There are several other cloud-based systems out there that do some of what we do. Here are some differences:
- We're the only open source solution.
- We're the only customisable solution (we support a bring-your-own-annotator option as well as pre-packaged entity annotation services like other systems).
- We're the only end-to-end full lifecycle solution. (We don't just do entity extraction -- we do data preparation, inter-annotator agreement, quality assurance and control, data visualisation, indexing and search of full text/annotation graph/ontology/instance store, etc. etc. etc.)
- Bulk upload of documents to process, no need to use programming APIs.
- No recurring monthly costs, pay-per-use, billed per hour.
- No daily limit on number of documents to process.
- No limit on document size.
- Costs of processing dependent on overall data size, not number of documents.
- Web-based collaborative annotation tool to correct mistakes and create training and evaluation data.
- Speed: other systems price per document (we price on processing time) -- this makes it impossible to compare like with like (do you really want to compare the processing of individual tweets against 200 page technical reports?!). GATECloud is also heavily optimised for high volumes -- if you want to do low volumes, you can do them on your netbook (did we mention that it's all open source? :-)).
- Community: we've been here for more than 15 years, and our community of developers, users, third party suppliers and so on is second to none.
3. GATE Mímir
GATE Mímir multiparadigm indexing: concept search, full-text search and annotation structure search in one scaleable index.
Mímir is a multi-paradigm information management index and repository which can be used to index and search over text, annotations, semantic schemas (ontologies), and semantic meta-data (instance data). It allows queries that arbitrarily mix full-text, structural, linguistic and semantic queries and that can scale to hundreds of gigabytes of text. A typical rich search or semantic annotation project deals with large quantities of data of different kinds. Mímir provides a framework for implementing indexing and search functionality across all these data types.
4. GATE Teamware
GATE Teamware is a workflow-based collaboration suite for manual and semi-automatic annotation and curation projects with distributed teams, QA, and process reporting.
It is a cost-effective environment for specifying and creating test and training data, enabling you to harness a widely distributed workforce and monitor progress & results remotely in real time. It’s also very easy to use: a new project can be up and running in less than five minutes.
5. The 2 Minute Guide to Helping People Find Stuff with GATE
- Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of parliament, and so on and so forth) -- call this your corpus.
- Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy, or something from the Linked Data cloud) -- call this your ontology.
- Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to the ontology (2.).
- Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and measure performance against the gold standard.
- Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems using GATE Embedded).
- Use GATE Mímir to store the annotations relative to the ontology in a multiparadigm index server. (For techies: this sits in the backroom as a RESTful web service.)
- Use Ontotext KIM to add semantic search, knowledge facet search, ontology browsing, entity popularity graphing, time series graphing, annotation structure search and (last but not least) boolean full text search. (More techy stuff: mash up these types of search with your existing UIs.)
Hey presto, you have state-of-the-art information management applying your ontology to your corpus (and a sustainable process)... But your users don't care. They're just happy because now they can find stuff.