GATE hints and tips
1. Using real-time corpus controller
Consider a situation where you have thousands of documents in your corpus. It might happen that application reaches half-way through after processing half of the documents successfully but then it fails due to some error in processing one of the documents. At this stage, if the application is terminated, you end up being in a situation where you would have to first find out the successfully processed documents and remove them manually from the corpus and then start the process for rest of the documents once again. This could be a tedious job if there are more than one faulty documents in your corpus.
Real-time corpus controllers can be useful in such a situation. In other words, if the process fails on one document, the real-time controllers skip that document and continue processing other documents. Similarly, one might want to specify a maximum time that is allowed for processing one document. The real-time corpus controllers take an init-time parameter called timeout which can be used for exactly this purpose. Thus, while using the real-time corpus controllers, a document is skipped not only when there is an exception processing the document but also when the application has taken longer than the specified time to process one document.
When using real-time corpus controller from GUI, information about the skipped documents is provided in the messages tab. However, if the real-time corpus controller is used in other scenarios (e.g. batch processing), one might want to add a PR such as a JAPE grammar at the end of the application to add a feature to the document or creating a special annotation marking successful processing of the document. Thus, if there was something wrong with the document and the document was skipped, such an annotation won't appear in the document.
Below, we provide an example JAPE grammar that can be used for such a purpose.
Phase: TheEnd Input: Lookup Token Options: control = once // important that it fires only once. Rule:TheEnd Priority:100 ( {Token} ):the_end --> :the_end.TheEnd = {rule = "the_end"}
All one needs to check after the execution of pipeline is whether the annotation of type TheEnd is present or not. If it is then everything went well if not the process timed out.