GATE.ac.uk - sale/tao/splitch18.html

Chapter 18
Parsers [#]

18.1 SUPPLE Parser [#]

SUPPLE is a bottom-up parser that constructs syntax trees and logical forms for English sentences. The parser is complete in the sense that every analysis licensed by the grammar is produced. In the current version only the ‘best’ parse is selected at the end of the parsing process. The English grammar is implemented as an attribute-value context free grammar which consists of subgrammars for noun phrases (NP), verb phrases (VP), prepositional phrases (PP), relative phrases (R) and sentences (S). The semantics associated with each grammar rule allow the parser to produce logical forms composed of unary predicates to denote entities and events (e.g., chase(e1), run(e2)) and binary predicates for properties (e.g. lsubj(e1,e2)). Constants (e.g., e1, e2) are used to represent entity and event identiﬁers. The GATE SUPPLE Wrapper stores syntactic information produced by the parser in the gate document in the form of parse annotations containing a bracketed representation of the parse; and semantics annotations that contains the logical forms produced by the parser. It also produces SyntaxTreeNode annotations that allow viewing of the parse tree for a sentence (see Section 18.1.4).

SUPPLE must be manually downloaded and installed, it does not appear in the GATE Developer plugin manager by default. See the “Building SUPPLE” section below for more details.

18.1.1 Requirements

The SUPPLE parser is written in Prolog, so you will need a Prolog interpreter to run the parser. A copy of PrologCafe (http://kaminari.scitec.kobe-u.ac.jp/PrologCafe/), a pure Java Prolog implementation, is provided in the distribution. This should work on any platform but it is not particularly fast. SUPPLE also supports the open-source SWI Prolog (http://www.swi-prolog.org) and the commercially licenced SICStus prolog (http://www.sics.se/sicstus, SUPPLE supports versions 3 and 4), which are available for Windows, Mac OS X, Linux and other Unix variants. For anything more than the simplest cases we recommend installing one of these instead of using PrologCafe.

18.1.2 Building SUPPLE

SUPPLE is not distributed via Maven repositories, it must be downloaded separately. Release versions are available from the GitHub releases page, the latest snapshot build is available from our snapshot repository. Download the ZIP ﬁle of the relevant version and unpack it to create a new directory gateplugin-Parser_SUPPLE-version . Alternatively you can clone the source code from the GitHub repository.

The binary distribution is ready to run using the PrologCafe interpreter, but the plugin must be rebuilt from source to use SWI or SICStus Prolog. Building from source requires a suitable Java JDK (GATE itself technically requires only the JRE to run). To build SUPPLE, ﬁrst edit the ﬁle build.xml in the SUPPLE distribution and adjust the user-conﬁgurable options at the top of the ﬁle to match your environment. In particular, if you are using SWI or SICStus Prolog, you will need to change the swi.executable or sicstus.executable property to the correct name for your system. Once this is done, you can build the plugin by opening a command prompt or shell, going to the directory where SUPPLE was unpacked, and running:

ant swi

For PrologCafe or SICStus, replace swi with plcafe or sicstus as appropriate.

The plugin must be rebuilt following any change to the Prolog sources.

18.1.3 Running the Parser in GATE

The SUPPLE plugin does not appear in the GATE Developer plugin manager by default. To load the plugin, open the plugin manager, click the “+” button at the top left, switch to the “directory URL” tab, and select the plugin directory you unpacked or cloned. This will add the plugin to the known plugins list and you can then select “load now” and/or “load always” as appropriate. Loading the SUPPLE plugin will also load the Tools and ANNIE plugins automatically.

In order to parse a document you will need to construct an application that has:

tokeniser
splitter
POS-tagger
Morphology
SUPPLE Parser with parameters
mapping ﬁle (conﬁg/mapping.conﬁg)
feature table ﬁle (conﬁg/feature_table.conﬁg)
parser ﬁle (supple.plcafe or supple.sicstus or supple.swi)
prolog implementation (shef.nlp.supple.prolog.PrologCafe,
shef.nlp.supple.prolog.SICStusProlog3, shef.nlp.supple.prolog.SICStusProlog4,
shef.nlp.supple.prolog.SWIProlog or shef.nlp.supple.prolog.SWIJavaProlog¹).
You can take a look at build.xml to see examples of invocation for the diﬀerent implementations.

18.1.4 Viewing the Parse Tree [#]

GATE Developer provides a syntax tree viewer in the Tools plugin which can display the parse tree generated by SUPPLE for a sentence. To use the tree viewer, be sure that the Tools plugin is loaded (this should happen automatically when SUPPLE is loaded), then open a document in GATE Developer that has been processed with SUPPLE and view its Sentence annotations. Right-click on the relevant Sentence annotation in the annotations table and select ‘Edit with syntax tree viewer’. This viewer can also be used with the constituency output of the Stanford Parser PR (Section 18.2).

18.1.5 System Properties [#]

The SICStusProlog (3 and 4) and SWIProlog implementations work by calling the native prolog executable, passing data back and forth in temporary ﬁles. The location of the prolog executable is speciﬁed by a system property:

for SICStus: supple.sicstus.executable - default is to look for sicstus.exe (Windows) or sicstus (other platforms) on the PATH.
for SWI: supple.swi.executable - default is to look for plcon.exe (Windows) or swipl (other platforms) on the PATH.

If your prolog is installed under a diﬀerent name, you should specify the correct name in the relevant system property. For example, when installed from the source distribution, the Unix version of SWI prolog is typically installed as pl, most binary packages install it as swipl, though some use the name swi-prolog. You can also use the properties to specify the full path to prolog (e.g. /opt/swi-prolog/bin/pl) if it is not on your default PATH.

For details of how to pass system properties to GATE, see the end of Section 2.3.

18.1.6 Conﬁguration Files [#]

Two ﬁles are used to pass information from GATE to the SUPPLE parser: the mapping ﬁle and the feature table ﬁle.

Mapping File

The mapping ﬁle speciﬁes how annotations produced using GATE are to be passed to the parser. The ﬁle is composed of a number of pairs of lines, the ﬁrst line in a pair speciﬁes a GATE annotation we want to pass to the parser. It includes the AnnotationSet (or default), the AnnotationType, and a number of features and values that depend on the AnnotationType. The second line of the pair speciﬁes how to encode the GATE annotation in a SUPPLE syntactic category, this line also includes a number of features and values. As an example consider the mapping:

Gate;AnnotationType=Token;category=DT;string=&S
SUPPLE;category=dt;m_root=&S;s_form=&S

It speciﬁes how a determinant (’DT’) will be translated into a category ‘dt’ for the parser. The construct ‘&S’ is used to represent a variable that will be instantiated to the appropriate value during the mapping process. More speciﬁcally a token like ‘The’ recognised as a DT by the POS-tagging will be mapped into the following category:

dt(s_form:’The’,m_root:’The’,m_affix:’_’,text:’_’).

As another example consider the mapping:

Gate;AnnotationType=Lookup;majorType=person_first;minorType=female;string=&S
SUPPLE;category=list_np;s_form=&S;ne_tag=person;ne_type=person_first;gender=female

It speciﬁed that an annotation of type ‘Lookup’ in GATE is mapped into a category ‘list_np’ with speciﬁc features and values. More speciﬁcally a token like ‘Mary’ identiﬁed in GATE as a Lookup will be mapped into the following SUPPLE category:

list_np(s_form:’Mary’,m_root:’_’,m_affix:’_’,
text:’_’,ne_tag:’person’,ne_type:’person_first’,gender:’female’).

Feature Table [#]

The feature table ﬁle speciﬁes SUPPLE ‘lexical’ categories and its features. As an example an entry in this ﬁle is:

n;s_form;m_root;m_affix;text;person;number

which speciﬁes which features and in which order a noun category should be written. In this case:

n(s_form:...,m_root:...,m_affix:...,text:...,person:...,number:....).

18.1.7 Parser and Grammar [#]

The parser builds a semantic representation compositionally, and a ‘best parse’ algorithm is applied to each ﬁnal chart, providing a partial parse if no complete sentence span can be constructed. The parser uses a feature valued grammar. Each Category entry has the form:

Category(Feature1:Value1,...,FeatureN:ValueN)

where the number and type of features is dependent on the category type (see Section 5.1). All categories will have the features s_form (surface form) and m_root (morphological root); nominal and verbal categories will also have person and number features; verbal categories will also have tense and vform features; and adjectival categories will have a degree feature. The list_np category has the same features as other nominal categories plus ne_tag and ne_type.

Syntactic rules are speciﬁed in Prolog with the predicate rule(LHS,RHS) where LHS is a syntactic category and RHS is a list of syntactic categories. A rule such as BNP _HEAD ⇒ N (‘a basic noun phrase head is composed of a noun’) is written as follows:

rule(bnp_head(sem:E^[[R,E],[number,E,N]],number:N),
[n(m_root:R,number:N)]).

where the feature ‘sem’ is used to construct the semantics while the parser processes input, and E, R, and N are variables to be instantiated during parsing.

The full grammar of this distribution can be found in the prolog/grammar directory, the ﬁle load.pl speciﬁes which grammars are used by the parser. The grammars are compiled when the system is built and the compiled version is used for parsing.

18.1.8 Mapping Named Entities

SUPPLE has a prolog grammar which deals with named entities, the only information required is the Lookup annotations produced by Gate, which are speciﬁed in the mapping ﬁle. However, you may want to pass named entities identiﬁed with your own Jape grammars in GATE. This can be done using a special syntactic category provided with this distribution. The category sem_cat is used as a bridge between Gate named entities and the SUPPLE grammar. An example of how to use it (provided in the mapping ﬁle) is:

Gate;AnnotationType=Date;string=&S
SUPPLE;category=sem_cat;type=Date;text=&S;kind=date;name=&S

which maps a named entity ‘Date’ into a syntactic category ’sem_cat’. A grammar ﬁle called semantic_rules.pl is provided to map sem_cat into the appropriate syntactic category expected by the phrasal rules. The following rule for example:

rule(ne_np(s_form:F,sem:X^[[name,X,NAME],[KIND,X]]),[
sem_cat(s_form:F,text:TEXT,type:’Date’,kind:KIND,name:NAME)]).

is used to parse a ‘Date’ into a named entity in SUPPLE which in turn will be parsed into a noun phrase.

18.2 Stanford Parser [#]

The Stanford Parser is a probabilistic parsing system implemented in Java by Stanford University’s Natural Language Processing Group. Data ﬁles are available from Stanford for parsing Arabic, Chinese, English, and German.

This PR (gate.stanford.Parser) acts as a wrapper around the Stanford Parser and translates GATE annotations to and from the data structures of the parser itself. The plugin is supplied with the unmodiﬁed jar ﬁle and one English data ﬁle obtained from Stanford. Stanford’s software itself is subject to the full GPL.

The parser itself can be trained on other corpora and languages, as documented on the website, but this plugin does not provide a means of doing so. Trained data ﬁles are not necessarily compatible between diﬀerent versions of the parser.

The current versions of the Stanford parser and this PR are threadsafe. Multiple instances of the PR with the same or diﬀerent model ﬁles can be used simultaneously.

18.2.1 Input Requirements

Documents to be processed by the Parser PR must already have Sentence and Token annotations, such as those produced by either ANNIE Sentence Splitter (Sections 6.4 and 6.5) and the ANNIE English Tokeniser (Section 6.2).

If the reusePosTags parameter is true, then the Token annotations must have category features with compatible POS tags. The tags produced by the ANNIE POS Tagger are compatible with Stanford’s parser data ﬁles for English (which also use the Penn treebank tagset).

18.2.2 Initialization Parameters

parserFile: the path to the trained data ﬁle; the default value points to the English data ﬁle² included with the GATE distribution. You can also use other ﬁles downloaded from the Stanford Parser website or produced by training the parser.
mappingFile: the optional path to a mapping ﬁle: a ﬂat, two-column ﬁle which the wrapper can use to ‘translate’ tags. A sample ﬁle is included.³ By default this value is null and mapping is ignored.
tlppClass: an implementation of TreebankLangParserParams, used by the parser itself to extract the dependency relations from the constituency structures. The default value is compatible with the English data ﬁle supplied. Please refer to the Stanford NLP Group’s documentation and the parser’s javadoc for a further explanation.

18.2.3 Runtime Parameters

annotationSetName: the name of the annotationSet used for input (Token and Sentence annotations) and output (SyntaxTreeNode and Dependency annotations, and category and dependencies features added to Tokens).
debug: a boolean value which controls the verbosity of the wrapper’s output.
reusePosTags: if true, the wrapper will read category features (produced by an earlier POS-tagging PR) from the Token annotations and force the parser to use them.
useMapping: if this is true and a mapping ﬁle was loaded when the PR was initialized, the POS and syntactic tags produced by the parser will be translated using that ﬁle. If no mapping ﬁle was loaded, this parameter is ignored.

The following boolean parameters switch on and oﬀ the various types of output that the parser can produce. Any or all of them can be true, but if all are false the PR will simply print a warning to save time (instead of running the parser).

addPosTags: if this is true, the wrapper will add category features to the Token annotations.
addConstituentAnnotations: if true, the wrapper will mark the syntactic constituents with SyntaxTreeNode annotations that are compatible with the Syntax Tree Viewer (see Section 18.1.4).
addDependencyAnnotations: if true, the wrapper will add Dependency annotations to indicate the dependency relations in the sentence.
addDependencyFeatures: if true, the wrapper will add dependencies features to the Token annotations to indicate the dependency relations in the sentence.

The parser will derive the dependency structures only if at least one of the dependency output options is enabled, so if you do not need the dependency analysis, set both of them to false so the PR will run faster.

The following parameters control the Stanford parser’s options for processing dependencies; please refer to the Stanford Dependencies Manual⁴ for details. These parameters are ignored unless at least one of the dependency-related parameters above is true. The default values (Typed and false) correspond to the behaviour of previous version of this PR.

dependencyMode

One of the following values:

Mode	equivalent command-line option

Typed	-basic
AllTyped	-nonCollapsed
TypedCollapsed	-collapsed
TypedCCprocessed	-CCprocessed

includeExtraDependencies

This has no eﬀect with the AllTyped mode; for the others, it determines whether to include “extras” such as control dependencies; if they are included, the complete set of dependencies may not follow a tree structure.

Two sample GATE applications for English are included in the plugins/Parser_Stanford directory: sample_parser_en.gapp runs the Regex Sentence Splitter and ANNIE Tokenizer and then uses this PR to annotate POS tags and constituency and dependency structures, whereas sample_pos+parser_en.gapp also runs the ANNIE POS Tagger and makes the parser re-use its POS tags.

¹shef.nlp.supple.prolog.SICStusProlog exists for backwards compatibility and behaves the same as SICStusProlog3.

²resources/englishPCFG.ser.gz

³resources/english-tag-map.txt

⁴ http://nlp.stanford.edu/software/parser-faq.shtml

[next] [prev] [prev-tail] [front] [up]