GATE.ac.uk - releases/gate-8.4-build5748-ALL/doc/tao/splitch18.html

Chapter 18
Parsers [#]

18.1 RASP Parser [#]

RASP (Robust Accurate Statistical Parsing) is a robust parsing system for English, developed by the Natural Language and Computational Linguistics group at the University of Sussex.

This plugin, ‘Parser_RASP’, developed by DigitalPebble, provides four wrapper PRs that call the RASP modules as external programs, as well as a JAPE component that translates the output of the ANNIE POS Tagger (Section 6.6).

RASP2 Tokenizer: This PR requires Sentence annotations and creates Token annotations with a string feature. Note that sentence-splitting must be carried out before tokenization; the the RegEx Sentence Splitter (see Section 6.5) is suitable for this. (Alternatively, you can use the ANNIE Tokenizer (Section 6.2) and then the ANNIE Sentence Splitter (Section 6.4); their output is compatible with the other PRs in this plugin).
RASP2 POS Tagger: This requires Token annotations and creates WordForm annotations with pos, probability, and string features.
RASP2 Morphological Analyser: This requires WordForm annotations (from the POS Tagger) and adds lemma and suffix features.
RASP2 Parser: This requires the preceding annotation types and creates multiple Dependency annotations to represent a parse of each sentence.
RASP POS Converter: This PR requires Token annotations with a category feature as produced by the ANNIE POS Tagger (see Section 6.6 and creates WordForm annotations in the RASP Format. The ANNIE POS Tagger and this Converter can together be used as a substitute for the RASP2 POS Tagger.

Here are some examples of corpus pipelines that can be correctly constructed with these PRs.

RegEx Sentence Splitter
RASP2 Tokenizer
RASP2 POS Tagger
RASP2 Morphological Analyser
RASP2 Parser

RegEx Sentence Splitter
RASP2 Tokenizer
ANNIE POS Tagger
RASP POS Converter
RASP2 Morphological Analyser
RASP2 Parser

ANNIE Tokenizer
ANNIE Sentence Splitter
RASP2 POS Tagger
RASP2 Morphological Analyser
RASP2 Parser

ANNIE Tokenizer
ANNIE Sentence Splitter
ANNIE POS Tagger
RASP POS Converter
RASP2 Morphological Analyser
RASP2 Parser

Further documentation is included in the directory gate/plugins/Parser\_RASP/doc/.

The RASP package, which provides the external programs, is available from the RASP web page.

RASP is only supported for Linux operating systems. Trying to run it on any other operating systems will generate an exception with the message: ‘The RASP cannot be run on any other operating systems except Linux.’

It must be correctly installed on the same machine as GATE, and must be installed in a directory whose path does not contain any spaces (this is a requirement of the RASP scripts as well as the wrapper). Before trying to run scripts for the ﬁrst time, edit rasp.sh and rasp_parse.sh to set the correct value for the shell variable RASP, which should be the ﬁle system pathname where you have installed the RASP tools (for example, RASP=/opt/RASP or RASP=/usr/local/RASP. You will need to enter the same path for the initialization parameter raspHome for the POS Tagger, Morphological Analyser, and Parser PRs.

(On some systems the arch command used in the scripts is not available; a work-around is to comment that line out and add arch=’ix86_linux’, for example.)

(The previous version of the RASP plugin can now be found in plugins/Obsolete/rasp.)

18.2 SUPPLE Parser [#]

SUPPLE is a bottom-up parser that constructs syntax trees and logical forms for English sentences. The parser is complete in the sense that every analysis licensed by the grammar is produced. In the current version only the ‘best’ parse is selected at the end of the parsing process. The English grammar is implemented as an attribute-value context free grammar which consists of subgrammars for noun phrases (NP), verb phrases (VP), prepositional phrases (PP), relative phrases (R) and sentences (S). The semantics associated with each grammar rule allow the parser to produce logical forms composed of unary predicates to denote entities and events (e.g., chase(e1), run(e2)) and binary predicates for properties (e.g. lsubj(e1,e2)). Constants (e.g., e1, e2) are used to represent entity and event identiﬁers. The GATE SUPPLE Wrapper stores syntactic information produced by the parser in the gate document in the form of parse annotations containing a bracketed representation of the parse; and semantics annotations that contains the logical forms produced by the parser. It also produces SyntaxTreeNode annotations that allow viewing of the parse tree for a sentence (see Section 18.2.4).

18.2.1 Requirements

The SUPPLE parser is written in Prolog, so you will need a Prolog interpreter to run the parser. A copy of PrologCafe (http://kaminari.scitec.kobe-u.ac.jp/PrologCafe/), a pure Java Prolog implementation, is provided in the distribution. This should work on any platform but it is not particularly fast. SUPPLE also supports the open-source SWI Prolog (http://www.swi-prolog.org) and the commercially licenced SICStus prolog (http://www.sics.se/sicstus, SUPPLE supports versions 3 and 4), which are available for Windows, Mac OS X, Linux and other Unix variants. For anything more than the simplest cases we recommend installing one of these instead of using PrologCafe.

18.2.2 Building SUPPLE

The SUPPLE plugin must be compiled before it can be used, so you will require a suitable Java SDK (GATE itself requires only the JRE to run). To build SUPPLE, ﬁrst edit the ﬁle build.xml in the Parser_SUPPLE directory under plugins, and adjust the user-conﬁgurable options at the top of the ﬁle to match your environment. In particular, if you are using SWI or SICStus Prolog, you will need to change the swi.executable or sicstus.executable property to the correct name for your system. Once this is done, you can build the plugin by opening a command prompt or shell, going to the Parser_SUPPLE directory and running:

ant swi

For PrologCafe or SICStus, replace swi with plcafe or sicstus as appropriate.

18.2.3 Running the Parser in GATE

In order to parse a document you will need to construct an application that has:

tokeniser
splitter
POS-tagger
Morphology
SUPPLE Parser with parameters
mapping ﬁle (conﬁg/mapping.conﬁg)
feature table ﬁle (conﬁg/feature_table.conﬁg)
parser ﬁle (supple.plcafe or supple.sicstus or supple.swi)
prolog implementation (shef.nlp.supple.prolog.PrologCafe,
shef.nlp.supple.prolog.SICStusProlog3, shef.nlp.supple.prolog.SICStusProlog4,
shef.nlp.supple.prolog.SWIProlog or shef.nlp.supple.prolog.SWIJavaProlog¹).
You can take a look at build.xml to see examples of invocation for the diﬀerent implementations.

Note that prior to GATE 3.1, the parser ﬁle parameter was of type java.io.File. From 3.1 it is of type java.net.URL. If you have a saved application (.gapp ﬁle) from before GATE 3.1 which includes SUPPLE it will need to be updated to work with the new version. Instructions on how to do this can be found in the README ﬁle in the SUPPLE plugin directory.

18.2.4 Viewing the Parse Tree [#]

GATE Developer provides a syntax tree viewer in the Tools plugin which can display the parse tree generated by SUPPLE for a sentence. To use the tree viewer, be sure that the Tools plugin is loaded, then open a document in GATE Developer that has been processed with SUPPLE and view its Sentence annotations. Right-click on the relevant Sentence annotation in the annotations table and select ‘Edit with syntax tree viewer’. This viewer can also be used with the constituency output of the Stanford Parser PR (Section 18.3).

18.2.5 System Properties [#]

The SICStusProlog (3 and 4) and SWIProlog implementations work by calling the native prolog executable, passing data back and forth in temporary ﬁles. The location of the prolog executable is speciﬁed by a system property:

for SICStus: supple.sicstus.executable - default is to look for sicstus.exe (Windows) or sicstus (other platforms) on the PATH.
for SWI: supple.swi.executable - default is to look for plcon.exe (Windows) or swipl (other platforms) on the PATH.

If your prolog is installed under a diﬀerent name, you should specify the correct name in the relevant system property. For example, when installed from the source distribution, the Unix version of SWI prolog is typically installed as pl, most binary packages install it as swipl, though some use the name swi-prolog. You can also use the properties to specify the full path to prolog (e.g. /opt/swi-prolog/bin/pl) if it is not on your default PATH.

For details of how to pass system properties to GATE, see the end of Section 2.3.

18.2.6 Conﬁguration Files [#]

Two ﬁles are used to pass information from GATE to the SUPPLE parser: the mapping ﬁle and the feature table ﬁle.

Mapping File

The mapping ﬁle speciﬁes how annotations produced using GATE are to be passed to the parser. The ﬁle is composed of a number of pairs of lines, the ﬁrst line in a pair speciﬁes a GATE annotation we want to pass to the parser. It includes the AnnotationSet (or default), the AnnotationType, and a number of features and values that depend on the AnnotationType. The second line of the pair speciﬁes how to encode the GATE annotation in a SUPPLE syntactic category, this line also includes a number of features and values. As an example consider the mapping:

Gate;AnnotationType=Token;category=DT;string=&S
SUPPLE;category=dt;m_root=&S;s_form=&S

It speciﬁes how a determinant (’DT’) will be translated into a category ‘dt’ for the parser. The construct ‘&S’ is used to represent a variable that will be instantiated to the appropriate value during the mapping process. More speciﬁcally a token like ‘The’ recognised as a DT by the POS-tagging will be mapped into the following category:

dt(s_form:’The’,m_root:’The’,m_affix:’_’,text:’_’).

As another example consider the mapping:

Gate;AnnotationType=Lookup;majorType=person_first;minorType=female;string=&S
SUPPLE;category=list_np;s_form=&S;ne_tag=person;ne_type=person_first;gender=female

It speciﬁed that an annotation of type ‘Lookup’ in GATE is mapped into a category ‘list_np’ with speciﬁc features and values. More speciﬁcally a token like ‘Mary’ identiﬁed in GATE as a Lookup will be mapped into the following SUPPLE category:

list_np(s_form:’Mary’,m_root:’_’,m_affix:’_’,
text:’_’,ne_tag:’person’,ne_type:’person_first’,gender:’female’).

Feature Table [#]

The feature table ﬁle speciﬁes SUPPLE ‘lexical’ categories and its features. As an example an entry in this ﬁle is:

n;s_form;m_root;m_affix;text;person;number

which speciﬁes which features and in which order a noun category should be written. In this case:

n(s_form:...,m_root:...,m_affix:...,text:...,person:...,number:....).

18.2.7 Parser and Grammar [#]

The parser builds a semantic representation compositionally, and a ‘best parse’ algorithm is applied to each ﬁnal chart, providing a partial parse if no complete sentence span can be constructed. The parser uses a feature valued grammar. Each Category entry has the form:

Category(Feature1:Value1,...,FeatureN:ValueN)

where the number and type of features is dependent on the category type (see Section 5.1). All categories will have the features s_form (surface form) and m_root (morphological root); nominal and verbal categories will also have person and number features; verbal categories will also have tense and vform features; and adjectival categories will have a degree feature. The list_np category has the same features as other nominal categories plus ne_tag and ne_type.

Syntactic rules are speciﬁed in Prolog with the predicate rule(LHS,RHS) where LHS is a syntactic category and RHS is a list of syntactic categories. A rule such as BNP_HEAD ⇒ N (‘a basic noun phrase head is composed of a noun’) is written as follows:

rule(bnp_head(sem:E^[[R,E],[number,E,N]],number:N),
[n(m_root:R,number:N)]).

where the feature ‘sem’ is used to construct the semantics while the parser processes input, and E, R, and N are variables to be instantiated during parsing.

The full grammar of this distribution can be found in the prolog/grammar directory, the ﬁle load.pl speciﬁes which grammars are used by the parser. The grammars are compiled when the system is built and the compiled version is used for parsing.

18.2.8 Mapping Named Entities

SUPPLE has a prolog grammar which deals with named entities, the only information required is the Lookup annotations produced by Gate, which are speciﬁed in the mapping ﬁle. However, you may want to pass named entities identiﬁed with your own Jape grammars in GATE. This can be done using a special syntactic category provided with this distribution. The category sem_cat is used as a bridge between Gate named entities and the SUPPLE grammar. An example of how to use it (provided in the mapping ﬁle) is:

Gate;AnnotationType=Date;string=&S
SUPPLE;category=sem_cat;type=Date;text=&S;kind=date;name=&S

which maps a named entity ‘Date’ into a syntactic category ’sem_cat’. A grammar ﬁle called semantic_rules.pl is provided to map sem_cat into the appropriate syntactic category expected by the phrasal rules. The following rule for example:

rule(ne_np(s_form:F,sem:X^[[name,X,NAME],[KIND,X]]),[
sem_cat(s_form:F,text:TEXT,type:’Date’,kind:KIND,name:NAME)]).

is used to parse a ‘Date’ into a named entity in SUPPLE which in turn will be parsed into a noun phrase.

18.2.9 Upgrading from BuChart to SUPPLE

In theory upgrading from BuChart to SUPPLE should be relatively straightforward. Basically any instance of BuChart needs to be replaced by SUPPLE. Speciﬁc changes which must be made are:

The compiled parser ﬁles are now supple.swi, supple.sicstus, or supple.plcafe
The GATE wrapper parameter buchartFile is now SUPPLEFile, and it is now of type java.net.URL rather than java.io.File. Details of how to compensate for this in existing saved applications are given in the SUPPLE README ﬁle.
The Prolog wrappers now start shef.nlp.supple.prolog instead of shef.nlp.buchart.prolog
The mapping.conf ﬁle now has lines starting SUPPLE; instead of Buchart;
Most importantly the main wrapper class is now called nlp.shef.supple.SUPPLE

Making these changes to existing code should be trivial and allow application to beneﬁt from future improvements to SUPPLE.

18.3 Stanford Parser [#]

The Stanford Parser is a probabilistic parsing system implemented in Java by Stanford University’s Natural Language Processing Group. Data ﬁles are available from Stanford for parsing Arabic, Chinese, English, and German.

This PR (gate.stanford.Parser) acts as a wrapper around the Stanford Parser and translates GATE annotations to and from the data structures of the parser itself. The plugin is supplied with the unmodiﬁed jar ﬁle and one English data ﬁle obtained from Stanford. Stanford’s software itself is subject to the full GPL.

The parser itself can be trained on other corpora and languages, as documented on the website, but this plugin does not provide a means of doing so. Trained data ﬁles are not necessarily compatible between diﬀerent versions of the parser.

The current versions of the Stanford parser and this PR are threadsafe. Multiple instances of the PR with the same or diﬀerent model ﬁles can be used simultaneously.

18.3.1 Input Requirements

Documents to be processed by the Parser PR must already have Sentence and Token annotations, such as those produced by either ANNIE Sentence Splitter (Sections 6.4 and 6.5) and the ANNIE English Tokeniser (Section 6.2).

If the reusePosTags parameter is true, then the Token annotations must have category features with compatible POS tags. The tags produced by the ANNIE POS Tagger are compatible with Stanford’s parser data ﬁles for English (which also use the Penn treebank tagset).

18.3.2 Initialization Parameters

parserFile: the path to the trained data ﬁle; the default value points to the English data ﬁle² included with the GATE distribution. You can also use other ﬁles downloaded from the Stanford Parser website or produced by training the parser.
mappingFile: the optional path to a mapping ﬁle: a ﬂat, two-column ﬁle which the wrapper can use to ‘translate’ tags. A sample ﬁle is included.³ By default this value is null and mapping is ignored.
tlppClass: an implementation of TreebankLangParserParams, used by the parser itself to extract the dependency relations from the constituency structures. The default value is compatible with the English data ﬁle supplied. Please refer to the Stanford NLP Group’s documentation and the parser’s javadoc for a further explanation.

18.3.3 Runtime Parameters

annotationSetName: the name of the annotationSet used for input (Token and Sentence annotations) and output (SyntaxTreeNode and Dependency annotations, and category and dependencies features added to Tokens).
debug: a boolean value which controls the verbosity of the wrapper’s output.
reusePosTags: if true, the wrapper will read category features (produced by an earlier POS-tagging PR) from the Token annotations and force the parser to use them.
useMapping: if this is true and a mapping ﬁle was loaded when the PR was initialized, the POS and syntactic tags produced by the parser will be translated using that ﬁle. If no mapping ﬁle was loaded, this parameter is ignored.

The following boolean parameters switch on and oﬀ the various types of output that the parser can produce. Any or all of them can be true, but if all are false the PR will simply print a warning to save time (instead of running the parser).

addPosTags: if this is true, the wrapper will add category features to the Token annotations.
addConstituentAnnotations: if true, the wrapper will mark the syntactic constituents with SyntaxTreeNode annotations that are compatible with the Syntax Tree Viewer (see Section 18.2.4).
addDependencyAnnotations: if true, the wrapper will add Dependency annotations to indicate the dependency relations in the sentence.
addDependencyFeatures: if true, the wrapper will add dependencies features to the Token annotations to indicate the dependency relations in the sentence.

The parser will derive the dependency structures only if at least one of the dependency output options is enabled, so if you do not need the dependency analysis, set both of them to false so the PR will run faster.

The following parameters control the Stanford parser’s options for processing dependencies; please refer to the Stanford Dependencies Manual⁴ for details. These parameters are ignored unless at least one of the dependency-related parameters above is true. The default values (Typed and false) correspond to the behaviour of previous version of this PR.

dependencyMode

One of the following values:

Mode	equivalent command-line option

Typed	-basic
AllTyped	-nonCollapsed
TypedCollapsed	-collapsed
TypedCCprocessed	-CCprocessed

includeExtraDependencies

This has no eﬀect with the AllTyped mode; for the others, it determines whether to include “extras” such as control dependencies; if they are included, the complete set of dependencies may not follow a tree structure.

Two sample GATE applications for English are included in the plugins/Parser_Stanford directory: sample_parser_en.gapp runs the Regex Sentence Splitter and ANNIE Tokenizer and then uses this PR to annotate POS tags and constituency and dependency structures, whereas sample_pos+parser_en.gapp also runs the ANNIE POS Tagger and makes the parser re-use its POS tags.

¹shef.nlp.supple.prolog.SICStusProlog exists for backwards compatibility and behaves the same as SICStusProlog3.

²resources/englishPCFG.ser.gz

³resources/english-tag-map.txt

⁴http://nlp.stanford.edu/software/parser-faq.shtml

[next] [prev] [prev-tail] [front] [up]