GATE Cloud Paralleliser
NOTE: this is a component of the GATE Cloud elastic text mining facility which is available from http://gatecloud.net/. This component currently has some dependencies on a specific file format from a previous project; contact us for details.
The GATE Cloud Paralelliser, the current version of which is codenamed A3, is a software tool aimed at parallel execution of automatic [semantic] annotation processes. It is intended to process batches comprising large numbers of documents in a robust, long-running process. We currently apply it to datasets in the 10s of millions of (long) documents. (For 100s of millions of documents we combine A3 with cloud or grid distribution mechanisms.) Its main characteristics are:
Scalability: It uses the appropriate number of parallel execution threads to consume 100% of the CPU capacity of dedicated servers. The number of threads used is configurable at start-up, and can be scaled up arbitrarily, according to the hardware. We recommend using a load factor of 1.5 threads per CPU core.
Flexibility: the A3 tool has a set of parameters that can be used to configure its behaviour. These can be used to select the GATE application being executed, the input protocol used for reading documents, the output protocol used for exporting the resulting annotations, the number of parallel threads to be used, the amount of RAM to be used.
Robustness: A3 is designed to run for extended periods of time (in fact it is designed to support a continuous running mode). To support this, it has been thoroughly tested and profiled, so as to avoid any possible memory leaks. Any errors and exceptions that may occur during processing are trapped and reported, without affecting the entire process. If the A3 process crashes for whatever reason (e.g. hardware failure, or power cut), the process can be restarted using exactly the same mechanism that was used to launch it originally. A3 will automatically identify the previous incomplete run, will parse the partial execution report file to find which documents were already processed successfully, and will resume execution from the point where the previous run stopped.
Low operating effort: A3 is designed as a batch process that does not require operator oversight or manual interventions. The only interaction with the system consists of (1) queuing batches for execution, and (2) analysing the execution reports to check the success state.
2. Installing A3
A3 is a Java application, and requires a Java 6 (or later) virtual machine to run. Due to the large memory requirements of complex GATE processing components over large amounts of data, a 64-bit JVM is recommended. A3 has been tested with 64-bit Java 6 from Sun on Linux and Apple on Mac OS X.
Installing A3 requires the following steps:
- Install Java (version 6, 64 bit recommended)
- set JAVA_HOME to the top level directory of your Java installation.
- The bash shell must be available on your PATH for the shell script used to run A3 (which uses some bash-specific syntax not available in the basic Bourne shell sh).
- Unpack the A3 software itself (provided as a zip file) to some location on disk. The A3 software distribution can be obtained on request from the Sheffield GATE team.
3. Running A3
In order to run A3, what is required is a valid install of the software (see installation instructions, above), access to the input documents (e.g. a directory on disk, or on a network share, containing the documents to be processed), and a batch specification file (see description, below).
The function of A3 is to process batches of documents by annotating them and saving the resulting annotations. A batch is a list of documents that originate from the same source, need to be processed with the same semantic annotation application, and should be saved to the same destination.
When launching the A3 process, the user needs to specify at least one batch to be processed. The operation of A3 comprises the following principal steps:
- for each batch queued for processing
- load and initialise the GATE application, creating one copy for each execution thread;
- if a partial report file exists, parse it, identify which documents have already been processed, and remove them from the current batch;
- otherwise create a new report file
- for each document in the current batch
- load the document from the source;
- process the document in the first available execution thread;
- dump the resulting annotations;
- add an element to the report file detailing the final state (success or failure), with some statistics in the case of success, and some details about the problem encountered in the case of failure.
- The existence of a partial report file signifies that the same batch was executed previously and that the execution has been interrupted.
- The formats for the batch specification and report files are described below.
- The output of the A3 process typically involves saving the created annotations in some format. Other output steps can added via configuration in the batch specification file (e.g. sending the annotated documents to the Mimir indexing engine).
3.1. Command line interface
A3 is currently being controlled using a command line interface. This may change in the future due to developments in the GATE cloud project.
The command line interface used by A3 is very simple and it consists of a single command with a small set of options. After changing to the directory where A3 is installed, the user can run A3 in one of two ways:
1) ./run-a3.sh [-t n] [-m nG] [-Dproperty=value] batch1.xml batch2.xml ... 2) ./run-a3.sh [-t n] [-m nG] [-Dproperty=value] -d inputDir outputDir
- [ and ] are used to mark optional parameters.
- -t can be used to specify the number of parallel execution threads to be used. This should be correlated with the number of CPU cores available to A3. We recommend using 1.5 times the number of cores. For example, on a 4 core machine, one would start 6 parallel execution threads using the command line option -t 6 .
- -m can be used to specify the amount of memory the A3 process can use for its heap allocation. We recommend allocating as much memory as possible to allow for optimal operation. Due to the use of parallel execution threads, A3 requires large amounts of RAM. The exact amounts of RAM needed depends on the complexity of the annotation application being executed and the maximum size of the input documents. For fairly complex processing, on large documents, we recommend allocating 2GB of RAM for each execution thread (i.e using the option -m 12G when running 6 threads). It is not recommended to allocate all the memory available in the system to the A3 heap, at least 2GB should be left for the operating system and other parts of the Java virtual machine. (For users familiar with Java, the -m option is passed as -Xmx to the Java virtual machine).
- Java system property settings can be supplied using -D options, e.g. -Djava.io.tmpdir=/data/tmp, which are passed through to the Java virtual machine unchanged. Note that there is no space between the -D and the property name.
In case 1, the rest of the command line consists of a list of one or more batch specification files (see description below), which will be executed in parallel, sharing a single thread pool (whose size is given by the -t option).
Example: a full invocation, with the recommended options, would then be:
./run-a3.sh -t 6 -m 12G batch1.xml
In case 2 (referred to as daemon mode), there should be exactly two arguments following the -d switch, which should refer to directories (inputDir and outputDir). In this mode, a3 scans the inputDir directory for any files with the extension .xml, which it assumes are batch specifications. If any are found it chooses one of them and runs the batch it defines. If the batch completes successfully the batch file is moved to the outputDir directory. This cycle repeats indefinitely - if there are no xml files in the input directory the process simply waits and checks again a few seconds later. Thus in this mode A3 acts as a daemon, processing batches as they appear in the input directory. To stop the process, create a file named shutdown.a3 in the input directory, which will cause the process to exit at the end of the current batch.
3.2. Batch definition file format
A3 gets its instructions about which annotation process to perform, on which documents, etc. from the batch specification file. This is an XML file, an example of which is included below:
<?xml version="1.0" encoding="UTF-8"?> <batch id="batch-0001" maxThreads="6"> <!-- There can be only one input specification --> <input documentRoot="/path/to/doc/root/directory" fileExtension=".xml.gz" mimeType="text/xml" encoding="UTF-8" compression="gzip">gate.cloud.io.FileInputHandler</input> <!-- There can be multiple output specifications, be one for each output file --> <output format="XCES" protocol="FILE_SYSTEM" dir="/data/a3/batches/batch-us-1k/out" fileExtension=".Measurement.xml" saveText="yes"> <!-- Which annotations to save to this file --> <annotationSet name="sam"> <annotationType name="Measurement"/> </annotationSet> </output> <!-- An example specification for outputting annotations in GATE XML format --> <output format="GATE_STANDOFF" protocol="FILE_SYSTEM" dir="/data/a3/batches/batch-us-1k/out" fileExtension=".gate.xml.gz" compress="yes"> <!-- all annotations from Original markups --> <annotationSet name="Original markups" /> <!-- specific types of annotations from the "sam" set --> <annotationSet name="sam"> <annotationType name="Measurement" /> <annotationType name="Section" /> <annotationType name="Reference" /> </annotationSet> <!-- All annotations from the default set (which does not have a name) --> <annotationSet /> </output> <!-- More output specifications here... --> <!-- Which GATE application to execute --> <application file="../../gate/applications/sam2-full.xgapp"/> <!-- Where to write the report file --> <report file="/data/a3/batches/batch-us-1k/report.xml"/> <!-- Which documents to process --> <documents> <ucid>US-20050205757-A1</ucid> <!-- More document IDs here... --> </documents> </batch>
The main elements of the batch specification file are:
<batch>: The root element. It has the attributes: id (an arbitrary string identifying the batch), and maxThreads (how many parallel threads can be used for this batch). The threads available to A3 (set using the command line interface) are shared by all batches, so threads are allocated to batches according to their availability. For example, if the first batch uses up all the available threads, then subsequent batches will have to wait for the completion of the first batch until they can get threads allocated to them.
<input>: Specifies how the documents to be annotated are obtained. The text of the element is the class name of an implementation of the gate.cloud.io.InputHandler interface. The attributes are converted into a map of parameters that are sent to the input handler class on construction. The example above uses the FileInputHandler, which accepts the following parameters:
- documentRoot: the top level directory containing the input documents;
- fileExtension: the extension of the files to be used for input;
- mimeType (optional): if provided, will be passed on to the GATE document loading routine;
- encoding (optional): if provided, will be passed on to the GATE document loading routine;
- compression (optional; permitted values "none", and "gzip"; defaults to "none"): are the input files compressed?
<output>: Multiple output elements are permitted. Each output element saves a set of annotations to a given file. The values currently supported for the protocol attribute are FILE_SYSTEM (which save to a file) and MIMIR_RPC (which sends the annotations to be indexed by the Mimir indexing engine running on a server). If the FILE_SYSTEM protocol is used, then the dir and fileExtension attributes should also be set (with the obvious semantics). The actual file name is derived from the UCID of each document; the location for the resulting output file in the sub-directory hierarchy is mimicking the input directory structure. If the compress attribute is set to yes then the output files will be compressed using the gzip algorithm (in which case you should ensure that the fileExtension ends with .gz).
<annotationSet> and <annotationType> sub-elements are used to select which annotations should be saved. To specify the default annotation set, omit the name attribute from the <annotationSet> element. An <annotationSet> element with no <annotationType> children indicates that all annotations in the specified set should be saved. For the GATE_STANDOFF format only, if there are no <annotationSet> elements then all annotations in all sets will be saved. For XCES you must specify at least one <annotationSet>.
<application>: Is used to specify which GATE application should be executed for the documents in the batch. The file attribute should point to a saved GATE application using .xgapp GATE-native format.
<report>: is used to specify the location of the report file produced for this batch.
<documents>: contains a list of <ucid> elements, each containing the UCID (Unique Character ID) of a document that is part of the current batch.
3.3. Report file format
During the processing of batches, A3 prints status information to a report file, specified in the batch definition.
An example report file, for a small batch of 6 documents is included below:
<?xml version='1.0' encoding='UTF-8'?> <cloudReport> <documents> <processResult ucid="US-20030121727-A1" returnCode="SUCCESS"> <fileSize>5194</fileSize> <executionTime>1886</executionTime> <statistics> <annotationCount type="sam:Reference" count="4" /> <annotationCount type="sam:Measurement" count="2" /> <annotationCount type="Original markups" count="144" /> </statistics> </processResult> <processResult ucid="US-20030091799-A1" returnCode="SUCCESS"> <fileSize>4620</fileSize> <executionTime>2323</executionTime> <statistics> <annotationCount type="sam:Reference" count="4" /> <annotationCount type="sam:Measurement" count="2" /> <annotationCount type="Original markups" count="82" /> </statistics> </processResult> <processResult ucid="US-20040109849-A1" returnCode="SUCCESS"> <fileSize>4376</fileSize> <executionTime>918</executionTime> <statistics> <annotationCount type="sam:Reference" count="3" /> <annotationCount type="sam:Measurement" count="2" /> <annotationCount type="Original markups" count="90" /> </statistics> </processResult> <processResult ucid="US-20050045426-A1" returnCode="SUCCESS"> <fileSize>3190</fileSize> <executionTime>639</executionTime> <statistics> <annotationCount type="sam:Reference" count="3" /> <annotationCount type="sam:Measurement" count="2" /> <annotationCount type="Original markups" count="68" /> </statistics> </processResult> <processResult ucid="US-20020074190-A1" returnCode="SUCCESS"> <fileSize>5911</fileSize> <executionTime>695</executionTime> <statistics> <annotationCount type="sam:Reference" count="5" /> <annotationCount type="sam:Measurement" count="2" /> <annotationCount type="Original markups" count="118" /> </statistics> </processResult> <processResult ucid="US-20010044522-A1" returnCode="SUCCESS"> <fileSize>4963</fileSize> <executionTime>616</executionTime> <statistics> <annotationCount type="sam:Reference" count="5" /> <annotationCount type="sam:Measurement" count="2" /> <annotationCount type="Original markups" count="105" /> </statistics> </processResult> </documents> <!--This shows the overall execution time for the whole batch. This value only includes the execution time of the last run (so if the batch was restarted, it will only include those documents that were processed last!--> <batchReport> <finalBatchState>FINISHED</finalBatchState> <totalDocuments>6</totalDocuments> <successfullyProcessed>6</successfullyProcessed> <withError>0</withError> <executionTime>10027</executionTime> </batchReport> </cloudReport>
The report includes a <documents> element, which contains a line with a <processResult> element for each document in the batch. The values for the returnCode attribute are SUCCESS or FAIL. In the case of success, some statistics are included, in the case of failure, a message describing the error encountered is included.
Finally, upon completing the execution for a whole batch, A3 will write some statistics describing the overall process, including the total execution time in milliseconds.