Log in Help
Homesaletao 〉 splitch24.html

Chapter 24
GATE Cloud [#]

The growth of unstructured content on the internet has resulted in an increased need for researchers in diverse fields to run language processing and text mining on large-scale datasets, many of which are impossible to process in reasonable time on standard desktops. However, in order to take advantage of the on-demand compute power and data storage on the cloud, NLP researchers currently have to re-write/adapt their algorithms.

Therefore, we have now adapted the GATE infrastructure (and its JAPE rule-based and machine learning engines) to the cloud and thus enabled researchers to run their GATE applications without a significant overhead. In addition to lowering the barrier to entry, GATE Cloud also reduces the time required to carry out large-scale NLP experiments by allowing researchers to harness the on-demand compute power of the cloud.

Cloud computing means many things in many contexts. On GATE Cloud it means:

GATE is (and always will be) free, but machine time, training, dedicated support and bespoke development is not. Using GATE Cloud you can rent cloud time to process large batches of documents on vast server farms, or academic clusters. You can push a terabyte of annotated data into an index server and replicate the data across the world. Or just purchase training services and support for the various tools in the GATE family.

24.1 GATE Cloud services: an overview [#]

GATE Cloud offers several types of services:

For an up to date list of available services see https://cloud.gate.ac.uk/shopfront.

24.2 Using GATE Cloud services [#]

The GATE Cloud platform is designed to make it easy for you to explore the available resources and experiment with them to find one (or more) that suits your needs. You can browse the available services and filter the pipelines and dedicated servers by tag. You can "try before you buy" – the detail page for each pipeline has a simple tool to allow you to paste in or upload a small sample of text, run the pipeline over the text, and browse the resulting annotations.

Once you have found a pipeline of interest you can use the on-line REST API to process documents free of charge. The basic quota allows you to process 1,200 documents per day at an average rate of 2 per second, but higher quotas are available for research users or by commercial arrangement with the GATE team. To use the API, first sign up for an account on GATE Cloud, then visit your account management page to generate an API key. There are links to client libraries and API documentation on the GATE Cloud site.

For other services – batch processing with one of the standard pipelines or one of your own, and dedicated Twitter collection or Mímir servers – you will need to buy credit vouchers from the University of Sheffield online shop. Vouchers are available in any multiple of £5, and you can buy additional vouchers at any time. Note that you must use exactly the same email address on the University shop as on your GATE Cloud account, in order for us to be able to match up your purchases and apply the credit to your account automatically. With the batch mode service there are no limits on the number or size of documents you can process, you simply pay for the amount of processing time you use and the amount of data you want to store, with a simple and transparent pricing structure.

As with the free quotas, we can offer discounts on the price of paid services for research users – contact us for more details.

24.3 Annotation Jobs on GATE Cloud [#]

GATE Cloud annotation jobs provide a way to quickly process large numbers of documents using a GATE application, with the results exported to files in GATE XML, JSON, or XCES format, and/or sent to a Mímir server for indexing. Annotation jobs are optimized for the processing of large batches of documents (tens of thousands or more) rather than processing a small number of documents on the fly (GATE Developer is best suited for the latter).

To submit an annotation job you first choose which GATE application you want to run. GATE Cloud provides some standard pre-packaged applications (e.g., ANNIE, TwitIE), or you can provide your own application. You then upload the documents you wish to process packaged up into ZIP or (optionally compressed) TAR archives, Twitter JSON bundles or ARC/WARC files (as produced by the Heritrix web crawler), and decide which annotations you would like returned as output, and in what format.

When the job is started, GATE Cloud takes the document archives you provided and divides them up into manageable-sized batches of up to 15,000 documents. Each batch is then processed using the GATE paralleliser and the generated output files are packaged up and made available for you to download from the GATE Cloud site when the job has completed.

24.3.1 The Annotation Service Charges Explained

GATE Cloud annotation jobs run on a public commercial cloud, which charges us per hour for the processing time we consume. As GATE Cloud allows you to run your own GATE application, and different GATE applications can process radically different numbers of documents in a given amount of time (depending on the complexity of the application) we cannot adopt the "£x per thousand documents" pricing structure used by other similar services. Instead, GATE Cloud passes on to you, the user, the per-hour charges we pay to the cloud provider plus a small mark-up to cover our own costs.

For a given annotation job, we add up the total amount of compute time taken to process all the individual batches of documents that make up your job (counted in seconds), round this number up to the next full hour and multiply this by the hourly price for the particular job type to get the total cost of the job. For example, if your annotation job was priced at £1 per hour and split into three batches that each took 56 minutes of compute time then the total cost of the job would be £3 (178 minutes of compute time, rounded up to 3 hours). However, if each batch took 62 minutes to process then the total cost would be £4 (184 minutes, rounded up to 4 hours). In addition we charge a data storage fee of (currently) £0.04 per GB per month for the data you store within the GATE Cloud platform. Data charges accrue pro-rata on a daily basis, so 2GB stored for half a month will cost the same as 1GB stored for a whole month.

While the job is running, we apply charges to your account whenever a job has consumed ten CPU hours since the last charge (which takes considerably less than ten real hours as several batches will typically execute in parallel). If your GATE Cloud account runs out of funds at any time, all your currently-executing annotation jobs will be suspended. You will be able to resume the suspended jobs once you have topped up your account to clear the negative balance. Note that it is not possible to download the result files from completed jobs if your GATE Cloud account is overdrawn.

24.3.2 Where to find more details [#]

Detailed documentation on the GATE Cloud platform can be found at https://cloud.gate.ac.uk/info/help, including

A Java client library and command-line tool for the REST APIs can be found at https://github.com/GateNLP/cloud-client, with extensive documentation on its own GitHub wiki, along with example code showing how you can call the APIs from other programming languages.

Finally, you can use the GATE-users mailing list if you have any questions not covered by the documentation.

24.4 GATE Cloud Pipeline URLs [#]

Many of the annotation pipelines covered in this guide have a GATE Cloud equivalent. These are listed and linked to below.

Type Pipeline

General purpose ANNIE

General purpose ANNIE+Measurements

General purpose POS and Morphology Analyzer

General purpose Noun Phrase Chunker

General purpose Measurement Annotator

General purpose OpenNLP

General purpose Custom Annotation Job

General purpose Mimir

Domain specific TwitIE

Domain specific Twitter User Classification

Domain specific Language Identification for Tweets

Domain specific Part of Speech Tagger for Tweets

Domain specific Social Media Tokenizer

Non-English general purposeGerman Named Entity Recognizer

Non-English general purposeFrench Named Entity Recognizer

Non-English general purposeRomanian Named Entity Recognizer)

Non-English general purposeRussian Named Entity Recognizer (basic)

Non-English general purposeRussian Named Entity Recognizer

Non-English general purposeCYMRIE Welsh Named Entity Recognizer

Non-English domain specific Social Media Tokenizer French

Non-English domain specific Social Media Tokenizer German

Non-English domain specific French NER for Tweets

Non-English domain specific German NER for Tweets