GATE.ac.uk - releases/gate-8.0-build4825-ALL/doc/tao/splitch15.html

Chapter 15
Non-English Language Support [#]

There are plugins available for processing the following languages: French, German, Italian, Chinese, Arabic, Romanian, Hindi, Russian, and Cebuano. Some of the applications are quite basic and just contain some useful processing resources to get you started when developing a full application. Others (Cebuano and Hindi) are more like toy systems built as part of an exercise in language portability.

Note that if you wish to use individual language processing resources without loading the whole application, you will need to load the relevant plugin for that language in most cases. The plugins all follow the same kind of format. Load the plugin using the plugin manager in GATE Developer, and the relevant resources will be available in the Processing Resources set.

Some plugins just contain a list of resources which can be added ad hoc to other applications. For example, the Italian plugin simply contains a lexicon which can be used to replace the English lexicon in the default English POS tagger: this will provide a reasonable basic POS tagger for Italian.

In most cases you will also ﬁnd a directory in the relevant plugin directory called data which contains some sample texts (in some cases, these are annotated with NEs).

There are also a number of plugins, documented elsewhere in this manual that while they default to processing English can be conﬁgured to support other languages. These include the TaggerFramework (Section 23.3), the OpenNLP plugin (Section 23.25), the Numbers Tagger (Section 23.8.1), and the Snowball based stemmer (Section 23.11). The LingPipe POS Tagger PR (Section 23.24.3) now includes two models for Bulgarian.

15.1 Language Identiﬁcation [#]

A common problem when handling multiple languages is determining the language of a document or section of document. For example, patent documents often contain the abstract in more than one language. In such cases you may want to only process those sections written in English, or you may want to run diﬀerent processing resources over the diﬀerent sections dependent upon the language they are written in. Once documents or sections are annotated with their language then it is easy to apply diﬀerent processing resources to the diﬀerent sections using either a Conditional Corpus Pipeline or via the Section-By-Section PR (Section 20.2.10). The problem is, of course, identifying the language.

The Language_Identification plugin contains a TextCat based PR for performing language identiﬁcation. The choice of languages used for categorization is speciﬁed through a conﬁguration ﬁle, the URL of which is the PRs only initialization parameter.

The PR has the following runtime parameters.

annotationType: If this is supplied, the PR classiﬁes the text underlying each annotation of the speciﬁed type and stores the result as a feature on that annotation. If this is left blank (null or empty), the PR classiﬁes the text of each document and stores the result as a document feature.
annotationSetName: The annotation set used for input and output; ignored if annotationType is blank.
languageFeatureName: The name of the document or annotation feature used to store the results.

Unlike most other PRs (which produce annotations), this one adds either document features or annotation features. (To classify both whole documents and spans within them, use two instances of this PR.) Note that classiﬁcation accuracy is better over long spans of text (paragraphs rather than sentences, for example).

Note that an alternative language identiﬁcation PR is available in the LingPipe plugin, which is documented in Section 23.24.5.

15.1.1 Fingerprint Generation [#]

Whilst the TextCat based PR supports a number of languages (not all of which are enabled in the default conﬁguration ﬁle), there may be occasiosn where you need to support a new language, or where the language of domain speciﬁc documents aﬀects the classiﬁcation. In these situations you can use the Fingerprint Generation PR included in the Language_Identification to build new ﬁngerprints from a corpus of documents.

The PR has no initialization parameters and is conﬁgured through the following runtime parameters:

annotationType: If this is supplied, the PR uses only the text underlying each annotation of the speciﬁed type to build the language ﬁngerprint. If this is left blank (null or empty), the PR will instead use the whole of each document to create the ﬁngerprint.
annotationSetName: The annotation set used for input; ignored if annotationType is blank.
ﬁngerprintURL: The URL to a ﬁle in which the ﬁngerprint should be stored – note that this must be a ﬁle URL.

15.2 French Plugin [#]

The French plugin contains two applications for NE recognition: one which includes the TreeTagger for POS tagging in French (french+tagger.gapp) , and one which does not (french.gapp). Simply load the application required from the plugins/Lang_French directory. You do not need to load the plugin itself from the GATE Developer’s Plugin Management Console. Note that the TreeTagger must ﬁrst be installed and set up correctly (see Section 23.3 for details). Check that the runtime parameters are set correctly for your TreeTagger in your application. The applications both contain resources for tokenisation, sentence splitting, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Note that they are not intended to produce high quality results, they are simply a starting point for a developer working on French. Some sample texts are contained in the plugins/Lang_French/data directory.

15.3 German Plugin [#]

The German plugin contains two applications for NE recognition: one which includes the TreeTagger for POS tagging in German (german+tagger.gapp) , and one which does not (german.gapp). Simply load the application required from the plugins/Lang_German/resources directory. You do not need to load the plugin itself from the GATE Developer’s Plugin Management Console. Note that the TreeTagger must ﬁrst be installed and set up correctly (see Section 23.3 for details). Check that the runtime parameters are set correctly for your TreeTagger in your application. The applications both contain resources for tokenisation, sentence splitting, gazetteer lookup, compound analysis, NE recognition (via JAPE grammars) and orthographic coreference. Some sample texts are contained in the plugins/Lang_German/data directory. We are grateful to Fabio Ciravegna and the Dot.KOM project for use of some of the components for the German plugin.

15.4 Romanian Plugin [#]

The Romanian plugin contains an application for Romanian NE recognition (romanian.gapp). Simply load the application from the plugins/Lang_Romanian/resources directory. You do not need to load the plugin itself from the GATE Developer’s Plugin Management Console. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Some sample texts are contained in the plugins/romanian/corpus directory.

15.5 Arabic Plugin [#]

The Arabic plugin contains a simple application for Arabic NE recognition (arabic.gapp). Simply load the application from the plugins/Lang_Arabic/resources directory. You do not need to load the plugin itself from the GATE Developer’s Plugin Management Console. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. Note that there are two types of gazetteer used in this application: one which was derived automatically from training data (Arabic inferred gazetteer), and one which was created manually. Note that there are some other applications included which perform quite speciﬁc tasks (but can generally be ignored). For example, arabic-for-bbn.gapp and arabic-for-muse.gapp make use of a very speciﬁc set of training data and convert the result to a special format. There is also an application to collect new gazetteer lists from training data (arabic_lists_collector.gapp). For details of the gazetteer list collector please see Section 13.7.

15.6 Chinese Plugin [#]

The Chinese plugin contains two components: a simple application for Chinese NE recognition (chinese.gapp) and a component called “Chinese Segmenter”.

In order to use the former, simply load the application from the plugins/Lang_Chinese/resources directory. You do not need to load the plugin itself from the GATE Developer’s Plugin Management Console. The application contains resources for tokenisation, gazetteer lookup, NE recognition (via JAPE grammars) and orthographic coreference. The application makes use of some gazetteer lists (and a grammar to process them) derived automatically from training data, as well as regular hand-crafted gazetteer lists. There are also applications (listscollector.gapp, adj_collector.gapp and nounperson_collector.gapp) to create such lists, and various other application to perform special tasks such as coreference evaluation (coreference_eval.gapp) and converting the output to a diﬀerent format (ace-to-muse.gapp).

15.6.1 Chinese Word Segmentation [#]

Unlike English, Chinese text does not have a symbol (or delimiter) such as blank space to explicitly separate a word from the surrounding words. Therefore, for automatic Chinese text processing, we may need a system to recognise the words in Chinese text, a problem known as Chinese word segmentation. The plugin described in this section performs the task of Chinese word segmentation. It is based on our work using the Perceptron learning algorithm for the Chinese word segmentation task of the Sighan 2005¹. [Li et al. 05c]. Our Perceptron based system has achieved very good performance in the Sighan-05 task.

The plugin is called Lang_Chinese and is available in the GATE distribution. The corresponding processing resource’s name is Chinese Segmenter PR. Once you load the PR into GATE, you may put it into a Pipeline application. Note that it does not process a corpus of documents, but a directory of documents provided as a parameter (see description of parameters below). The plugin can be used to learn a model from segmented Chinese text as training data. It can also use the learned model to segment Chinese text. The plugin can use diﬀerent learning algorithms to learn diﬀerent models. It can deal with diﬀerent character encodings for Chinese text, such as UTF-8, GB2312 or BIG5. These options can be selected by setting the run-time parameters of the plugin.

The plugin has ﬁve run-time parameters, which are described in the following.

learningAlg is a String variable, which speciﬁes the learning algorithm used for producing the model. Currently it has two values, PAUM and SVM, representing the two popular learning algorithms Perceptron and SVM, respectively. The default value is PAUM.
Generally speaking, SVM may perform better than Perceptron, in particular for small training sets. On the other hand, Perceptron’s learning is much faster than SVM’s. Hence, if you have a small training set, you may want to use SVM to obtain a better model. However, if you have a big training set which is typical for the Chinese word segmentation task, you may want to use Perceptron for learning, because the SVM’s learning may take too long time. In addition, using a big training set, the performance of the Perceptron model is quite similar to that of the SVM model. See [Li et al. 05c] for the experimental comparison of SVM and Perceptron on Chinese word segmentation.
learningMode determines the two modes of using the plugin, either learning a model from training data or applying a learned model to segment Chinese text. Accordingly it has two values, SEGMENTING and LEARNING. The default value is SEGMENTING, meaning segmenting the Chinese text.
Note that you ﬁrst need to learn a model and then you can use the learned model to segment the text. Several models using the training data used in the Sighan-05 Bakeoﬀ are available for this plugin, which you can use to segment your Chinese text. More descriptions about the provided models will be given below.
modelURL speciﬁes an URL referring to a directory containing the model. If the plugin is in the LEARNING runmode, the model learned will be put into the directory. If it is in the SEGMENTING runmode, the plugin will use the model stored in the directory to segment the text. The models learned from the Sighan-05 bakeoﬀ training data will be discussed below.
textCode speciﬁes the encoding of the text used. For example it can be UTF-8, BIG5, GB2312 or any other encoding for Chinese text. Note that, when you segment some Chinese text using a learned model, the Chinese text should use the same encoding as the one used by the training text for obtaining the model.
textFilesURL speciﬁes an URL referring to a directory containing the Chinese documents. All the documents contained in this directory (but not those documents contained in its sub-directory if there is any) will be used as input data. In the LEARNING runmode, those documents contain the segmented Chinese text as training data. In the SEGMENTING runmode, the text in those documents will be segmented. The segmented text will be stored in the corresponding documents in the sub-directory called segmented.

The following PAUM models are distributed with plugins and are available as compressed zip ﬁles under the plugins/Lang_Chinese/resources/models directory. Please unzip them to use. In detail, those models were learned using the PAUM learning algorithm from the corpora provided by Sighan-05 bakeoﬀ task.

the PAUM model learned from PKU training data, using the PAUM learning algorithm and the UTF-8 encoding, is available as model-paum-pku-utf8.zip.
the PAUM model learned from PKU training data, using the PAUM learning algorithm and the GB2312 encoding, is available as model-paum-pku-gb.zip.
the PAUM model learned from AS training data, using the PAUM learning algorithm and the UTF-8 encoding, is available as model-as-utf8.zip.
the PAUM model learned from AS training data, using the PAUM learning algorithm and the BIG5 encoding, is available as model-as-big5.zip.

As you can see, those models were learned using diﬀerent training data and diﬀerent Chinese text encodings of the same training data. The PKU training data are news articles published in mainland China and use simpliﬁed Chinese, while the AS training data are news articles published in Taiwan and use traditional Chinese. If your text are in simpliﬁed Chinese, you can use the models trained by the PKU data. If your text are in traditional Chinese, you need to use the models trained by the AS data. If your data are in GB2312 encoding or any compatible encoding, you need use the model trained by the corpus in GB2312 encoding.

Note that the segmented Chinese text (either used as training data or produced by this plugin) use the blank space to separate a word from its surrounding words. Hence, if your data are in Unicode such as UTF-8, you can use the GATE Unicode Tokeniser to process the segmented text to add the Token annotations into your text to represent the Chinese words. Once you get the annotations for all the Chinese words, you can perform further processing such as POS tagging and named entity recognition.

15.7 Hindi Plugin [#]

The Hindi plugin (‘Lang_Hindi’) contains a set of resources for basic Hindi NE recognition which mirror the ANNIE resources but are customised to the Hindi language. You need to have the ANNIE plugin loaded ﬁrst in order to load any of these PRs. With the Hindi, you can create an application similar to ANNIE but replacing the ANNIE PRs with the default PRs from the plugin.

15.8 Russian Plugin [#]

The Russian plugin (Lang_Russian) contains a set of resource for a Russian IE application which mirrors the construction of ANNIE. This includes custom components for part-of-speech tagging, morphological analysis and gazetteer lookup. A number of ready-made applications are also available which combine these resources together in a number of ways.

15.9 Bulgarian Plugin [#]

The Bulgarian plugin (Lang_Bulgarian) containts a GATE PR which integrates the BulStem stemmer into GATE. Currently no other Bulgarian speciﬁc PRs are available so the stemmer should be used with the Unicode tokenizer and a sentence splitter to process Bulgarian language documents.

¹See http://www.sighan.org/bakeoﬀ2005/ for the Sighan-05 task

[next] [prev] [prev-tail] [front] [up]