GATE Cloud [#]
The growth of unstructured content on the internet has resulted in an increased need for researchers in diverse ﬁelds to run language processing and text mining on large-scale datasets, many of which are impossible to process in reasonable time on standard desktops. However, in order to take advantage of the on-demand compute power and data storage on the cloud, NLP researchers currently have to re-write/adapt their algorithms.
Therefore, we have now adapted the GATE infrastructure (and its JAPE rule-based and machine learning engines) to the cloud and thus enabled researchers to run their GATE applications without a signiﬁcant overhead. In addition to lowering the barrier to entry, GATE Cloud also reduces the time required to carry out large-scale NLP experiments by allowing researchers to harness the on-demand compute power of the cloud.
Cloud computing means many things in many contexts. On GATE Cloud it means:
zero ﬁxed costs: you don’t buy software licences or server hardware, just pay for the compute time that you use.
near zero startup time: in a matter of minutes you can specify, provision and deploy the type of computation that used to take months of planning.
easy in, easy out: if you try it and don’t like it, go elsewhere! You can even take the software with you; it’s all open-source.
cloud providers’ data center managers (we use Amazon Inc.) make sure the hardware and operating platform for your work is scaleable, reliable and cheap.
GATE is (and always will be) free, but machine time, training, dedicated support and bespoke development is not. Using GATE Cloud you can rent cloud time to process large batches of documents on vast server farms, or academic clusters. You can push a terabyte of annotated data into an index server and replicate the data across the world. Or just purchase training services and support for the various tools in the GATE family.
GATE Cloud oﬀers several types of services:
Run a pre-packaged annotation pipeline such as ANNIE or TwitIE. Individual documents can be processed free of charge using a REST API (rate limits apply) or larger batches of documents can be processed using the paid service described below.
Uniquely among online text-mining platforms, the batch-mode service also allows you to build your own custom pipeline in GATE Developer and upload it to run on the cloud infrastructure.
Rent a dedicated server to index your documents using GATE Mímir (chapter 26), or to collect social media data via Twitter’s streaming APIs.
For an up to date list of available services see https://cloud.gate.ac.uk/shopfront.
24.2 Using GATE Cloud services [#]
The GATE Cloud platform is designed to make it easy for you to explore the available resources and experiment with them to ﬁnd one (or more) that suits your needs. You can browse the available services and ﬁlter the pipelines and dedicated servers by tag. You can "try before you buy" – the detail page for each pipeline has a simple tool to allow you to paste in or upload a small sample of text, run the pipeline over the text, and browse the resulting annotations.
Once you have found a pipeline of interest you can use the on-line REST API to process documents free of charge. The basic quota allows you to process 1,200 documents per day at an average rate of 2 per second, but higher quotas are available for research users or by commercial arrangement with the GATE team. To use the API, ﬁrst sign up for an account on GATE Cloud, then visit your account management page to generate an API key. There are links to client libraries and API documentation on the GATE Cloud site.
For other services – batch processing with one of the standard pipelines or one of your own, and dedicated Twitter collection or Mímir servers – you will need to buy credit vouchers from the University of Sheﬃeld online shop. Vouchers are available in any multiple of £5, and you can buy additional vouchers at any time. Note that you must use exactly the same email address on the University shop as on your GATE Cloud account, in order for us to be able to match up your purchases and apply the credit to your account automatically. With the batch mode service there are no limits on the number or size of documents you can process, you simply pay for the amount of processing time you use and the amount of data you want to store, with a simple and transparent pricing structure.
As with the free quotas, we can oﬀer discounts on the price of paid services for research users – contact us for more details.
24.3 Annotation Jobs on GATE Cloud [#]
GATE Cloud annotation jobs provide a way to quickly process large numbers of documents using a GATE application, with the results exported to ﬁles in GATE XML, JSON, or XCES format, and/or sent to a Mímir server for indexing. Annotation jobs are optimized for the processing of large batches of documents (tens of thousands or more) rather than processing a small number of documents on the ﬂy (GATE Developer is best suited for the latter).
To submit an annotation job you ﬁrst choose which GATE application you want to run. GATE Cloud provides some standard pre-packaged applications (e.g., ANNIE, TwitIE), or you can provide your own application. You then upload the documents you wish to process packaged up into ZIP or (optionally compressed) TAR archives, Twitter JSON bundles or ARC/WARC ﬁles (as produced by the Heritrix web crawler), and decide which annotations you would like returned as output, and in what format.
When the job is started, GATE Cloud takes the document archives you provided and divides them up into manageable-sized batches of up to 15,000 documents. Each batch is then processed using the GATE paralleliser and the generated output ﬁles are packaged up and made available for you to download from the GATE Cloud site when the job has completed.
GATE Cloud annotation jobs run on a public commercial cloud, which charges us per hour for the processing time we consume. As GATE Cloud allows you to run your own GATE application, and diﬀerent GATE applications can process radically diﬀerent numbers of documents in a given amount of time (depending on the complexity of the application) we cannot adopt the "£x per thousand documents" pricing structure used by other similar services. Instead, GATE Cloud passes on to you, the user, the per-hour charges we pay to the cloud provider plus a small mark-up to cover our own costs.
For a given annotation job, we add up the total amount of compute time taken to process all the individual batches of documents that make up your job (counted in seconds), round this number up to the next full hour and multiply this by the hourly price for the particular job type to get the total cost of the job. For example, if your annotation job was priced at £1 per hour and split into three batches that each took 56 minutes of compute time then the total cost of the job would be £3 (178 minutes of compute time, rounded up to 3 hours). However, if each batch took 62 minutes to process then the total cost would be £4 (184 minutes, rounded up to 4 hours). In addition we charge a data storage fee of (currently) £0.04 per GB per month for the data you store within the GATE Cloud platform. Data charges accrue pro-rata on a daily basis, so 2GB stored for half a month will cost the same as 1GB stored for a whole month.
While the job is running, we apply charges to your account whenever a job has consumed ten CPU hours since the last charge (which takes considerably less than ten real hours as several batches will typically execute in parallel). If your GATE Cloud account runs out of funds at any time, all your currently-executing annotation jobs will be suspended. You will be able to resume the suspended jobs once you have topped up your account to clear the negative balance. Note that it is not possible to download the result ﬁles from completed jobs if your GATE Cloud account is overdrawn.
24.3.2 Where to ﬁnd more details [#]
Detailed documentation on the GATE Cloud platform can be found at https://cloud.gate.ac.uk/info/help, including
Documentation for the various REST APIs
Details of how to prepare your own custom pipeline to run as a batch job
A Java client library and command-line tool for the REST APIs can be found at https://github.com/GateNLP/cloud-client, with extensive documentation on its own GitHub wiki, along with example code showing how you can call the APIs from other programming languages.
Finally, you can use the GATE-users mailing list if you have any questions not covered by the documentation.
24.4 GATE Cloud Pipeline URLs [#]
Many of the annotation pipelines covered in this guide have a GATE Cloud equivalent. These are listed and linked to below.