Chapter 18
Parsers [#]
18.1 RASP Parser [#]
RASP (Robust Accurate Statistical Parsing) is a robust parsing system for English, developed by the Natural Language and Computational Linguistics group at the University of Sussex.
This plugin, ‘Parser_RASP’, developed by DigitalPebble, provides four wrapper PRs that call the RASP modules as external programs, as well as a JAPE component that translates the output of the ANNIE POS Tagger (Section 6.6).
- RASP2 Tokenizer
- This PR requires Sentence annotations and creates Token annotations with a string feature. Note that sentence-splitting must be carried out before tokenization; the the RegEx Sentence Splitter (see Section 6.5) is suitable for this. (Alternatively, you can use the ANNIE Tokenizer (Section 6.2) and then the ANNIE Sentence Splitter (Section 6.4); their output is compatible with the other PRs in this plugin).
- RASP2 POS Tagger
- This requires Token annotations and creates WordForm annotations with pos, probability, and string features.
- RASP2 Morphological Analyser
- This requires WordForm annotations (from the POS Tagger) and adds lemma and suffix features.
- RASP2 Parser
- This requires the preceding annotation types and creates multiple Dependency annotations to represent a parse of each sentence.
- RASP POS Converter
- This PR requires Token annotations with a category feature as produced by the ANNIE POS Tagger (see Section 6.6 and creates WordForm annotations in the RASP Format. The ANNIE POS Tagger and this Converter can together be used as a substitute for the RASP2 POS Tagger.
Here are some examples of corpus pipelines that can be correctly constructed with these PRs.
- RegEx Sentence Splitter
- RASP2 Tokenizer
- RASP2 POS Tagger
- RASP2 Morphological Analyser
- RASP2 Parser
- RegEx Sentence Splitter
- RASP2 Tokenizer
- ANNIE POS Tagger
- RASP POS Converter
- RASP2 Morphological Analyser
- RASP2 Parser
- ANNIE Tokenizer
- ANNIE Sentence Splitter
- RASP2 POS Tagger
- RASP2 Morphological Analyser
- RASP2 Parser
- ANNIE Tokenizer
- ANNIE Sentence Splitter
- ANNIE POS Tagger
- RASP POS Converter
- RASP2 Morphological Analyser
- RASP2 Parser
Further documentation is included in the directory gate/plugins/Parser\_RASP/doc/.
The RASP package, which provides the external programs, is available from the RASP web page.
RASP is only supported for Linux operating systems. Trying to run it on any other operating systems will generate an exception with the message: ‘The RASP cannot be run on any other operating systems except Linux.’
It must be correctly installed on the same machine as GATE, and must be installed in a directory whose path does not contain any spaces (this is a requirement of the RASP scripts as well as the wrapper). Before trying to run scripts for the first time, edit rasp.sh and rasp_parse.sh to set the correct value for the shell variable RASP, which should be the file system pathname where you have installed the RASP tools (for example, RASP=/opt/RASP or RASP=/usr/local/RASP. You will need to enter the same path for the initialization parameter raspHome for the POS Tagger, Morphological Analyser, and Parser PRs.
(On some systems the arch command used in the scripts is not available; a work-around is to comment that line out and add arch=’ix86_linux’, for example.)
(The previous version of the RASP plugin can now be found in plugins/Obsolete/rasp.)
18.2 SUPPLE Parser [#]
SUPPLE is a bottom-up parser that constructs syntax trees and logical forms for English sentences. The parser is complete in the sense that every analysis licensed by the grammar is produced. In the current version only the ‘best’ parse is selected at the end of the parsing process. The English grammar is implemented as an attribute-value context free grammar which consists of subgrammars for noun phrases (NP), verb phrases (VP), prepositional phrases (PP), relative phrases (R) and sentences (S). The semantics associated with each grammar rule allow the parser to produce logical forms composed of unary predicates to denote entities and events (e.g., chase(e1), run(e2)) and binary predicates for properties (e.g. lsubj(e1,e2)). Constants (e.g., e1, e2) are used to represent entity and event identifiers. The GATE SUPPLE Wrapper stores syntactic information produced by the parser in the gate document in the form of parse annotations containing a bracketed representation of the parse; and semantics annotations that contains the logical forms produced by the parser. It also produces SyntaxTreeNode annotations that allow viewing of the parse tree for a sentence (see Section 18.2.4).
18.2.1 Requirements
The SUPPLE parser is written in Prolog, so you will need a Prolog interpreter to run the parser. A copy of PrologCafe (http://kaminari.scitec.kobe-u.ac.jp/PrologCafe/), a pure Java Prolog implementation, is provided in the distribution. This should work on any platform but it is not particularly fast. SUPPLE also supports the open-source SWI Prolog (http://www.swi-prolog.org) and the commercially licenced SICStus prolog (http://www.sics.se/sicstus, SUPPLE supports versions 3 and 4), which are available for Windows, Mac OS X, Linux and other Unix variants. For anything more than the simplest cases we recommend installing one of these instead of using PrologCafe.
18.2.2 Building SUPPLE
The SUPPLE plugin must be compiled before it can be used, so you will require a suitable Java SDK (GATE itself requires only the JRE to run). To build SUPPLE, first edit the file build.xml in the Parser_SUPPLE directory under plugins, and adjust the user-configurable options at the top of the file to match your environment. In particular, if you are using SWI or SICStus Prolog, you will need to change the swi.executable or sicstus.executable property to the correct name for your system. Once this is done, you can build the plugin by opening a command prompt or shell, going to the Parser_SUPPLE directory and running:
For PrologCafe or SICStus, replace swi with plcafe or sicstus as appropriate.
18.2.3 Running the Parser in GATE
In order to parse a document you will need to construct an application that has:
- tokeniser
- splitter
- POS-tagger
- Morphology
- SUPPLE Parser with parameters
mapping file (config/mapping.config)
feature table file (config/feature_table.config)
parser file (supple.plcafe or supple.sicstus or supple.swi)
prolog implementation (shef.nlp.supple.prolog.PrologCafe,
shef.nlp.supple.prolog.SICStusProlog3, shef.nlp.supple.prolog.SICStusProlog4,
shef.nlp.supple.prolog.SWIProlog or shef.nlp.supple.prolog.SWIJavaProlog1).You can take a look at build.xml to see examples of invocation for the different implementations.
Note that prior to GATE 3.1, the parser file parameter was of type java.io.File. From 3.1 it is of type java.net.URL. If you have a saved application (.gapp file) from before GATE 3.1 which includes SUPPLE it will need to be updated to work with the new version. Instructions on how to do this can be found in the README file in the SUPPLE plugin directory.
18.2.4 Viewing the Parse Tree [#]
GATE Developer provides a syntax tree viewer in the Tools plugin which can display the parse tree generated by SUPPLE for a sentence. To use the tree viewer, be sure that the Tools plugin is loaded, then open a document in GATE Developer that has been processed with SUPPLE and view its Sentence annotations. Right-click on the relevant Sentence annotation in the annotations table and select ‘Edit with syntax tree viewer’. This viewer can also be used with the constituency output of the Stanford Parser PR (Section 18.3).
18.2.5 System Properties [#]
The SICStusProlog (3 and 4) and SWIProlog implementations work by calling the native prolog executable, passing data back and forth in temporary files. The location of the prolog executable is specified by a system property:
- for SICStus: supple.sicstus.executable - default is to look for sicstus.exe (Windows) or sicstus (other platforms) on the PATH.
- for SWI: supple.swi.executable - default is to look for plcon.exe (Windows) or swipl (other platforms) on the PATH.
If your prolog is installed under a different name, you should specify the correct name in the relevant system property. For example, when installed from the source distribution, the Unix version of SWI prolog is typically installed as pl, most binary packages install it as swipl, though some use the name swi-prolog. You can also use the properties to specify the full path to prolog (e.g. /opt/swi-prolog/bin/pl) if it is not on your default PATH.
For details of how to pass system properties to GATE, see the end of Section 2.3.
18.2.6 Configuration Files [#]
Two files are used to pass information from GATE to the SUPPLE parser: the mapping file and the
feature table file.
Mapping File
The mapping file specifies how annotations produced using GATE are to be passed to the parser. The file is composed of a number of pairs of lines, the first line in a pair specifies a GATE annotation we want to pass to the parser. It includes the AnnotationSet (or default), the AnnotationType, and a number of features and values that depend on the AnnotationType. The second line of the pair specifies how to encode the GATE annotation in a SUPPLE syntactic category, this line also includes a number of features and values. As an example consider the mapping:
SUPPLE;category=dt;m_root=&S;s_form=&S
It specifies how a determinant (’DT’) will be translated into a category ‘dt’ for the parser. The construct ‘&S’ is used to represent a variable that will be instantiated to the appropriate value during the mapping process. More specifically a token like ‘The’ recognised as a DT by the POS-tagging will be mapped into the following category:
As another example consider the mapping:
SUPPLE;category=list_np;s_form=&S;ne_tag=person;ne_type=person_first;gender=female
It specified that an annotation of type ‘Lookup’ in GATE is mapped into a category ‘list_np’ with specific features and values. More specifically a token like ‘Mary’ identified in GATE as a Lookup will be mapped into the following SUPPLE category:
text:’_’,ne_tag:’person’,ne_type:’person_first’,gender:’female’).
Feature Table [#]
The feature table file specifies SUPPLE ‘lexical’ categories and its features. As an example an entry in this file is:
which specifies which features and in which order a noun category should be written. In this case:
18.2.7 Parser and Grammar [#]
The parser builds a semantic representation compositionally, and a ‘best parse’ algorithm is applied to each final chart, providing a partial parse if no complete sentence span can be constructed. The parser uses a feature valued grammar. Each Category entry has the form:
where the number and type of features is dependent on the category type (see Section 5.1). All categories will have the features s_form (surface form) and m_root (morphological root); nominal and verbal categories will also have person and number features; verbal categories will also have tense and vform features; and adjectival categories will have a degree feature. The list_np category has the same features as other nominal categories plus ne_tag and ne_type.
Syntactic rules are specified in Prolog with the predicate rule(LHS,RHS) where LHS is a syntactic category and RHS is a list of syntactic categories. A rule such as BNP_HEAD ⇒ N (‘a basic noun phrase head is composed of a noun’) is written as follows:
[n(m_root:R,number:N)]).
where the feature ‘sem’ is used to construct the semantics while the parser processes input, and E, R, and N are variables to be instantiated during parsing.
The full grammar of this distribution can be found in the prolog/grammar directory, the file load.pl specifies which grammars are used by the parser. The grammars are compiled when the system is built and the compiled version is used for parsing.
18.2.8 Mapping Named Entities
SUPPLE has a prolog grammar which deals with named entities, the only information required is the Lookup annotations produced by Gate, which are specified in the mapping file. However, you may want to pass named entities identified with your own Jape grammars in GATE. This can be done using a special syntactic category provided with this distribution. The category sem_cat is used as a bridge between Gate named entities and the SUPPLE grammar. An example of how to use it (provided in the mapping file) is:
SUPPLE;category=sem_cat;type=Date;text=&S;kind=date;name=&S
which maps a named entity ‘Date’ into a syntactic category ’sem_cat’. A grammar file called semantic_rules.pl is provided to map sem_cat into the appropriate syntactic category expected by the phrasal rules. The following rule for example:
sem_cat(s_form:F,text:TEXT,type:’Date’,kind:KIND,name:NAME)]).
is used to parse a ‘Date’ into a named entity in SUPPLE which in turn will be parsed into a noun phrase.
18.2.9 Upgrading from BuChart to SUPPLE
In theory upgrading from BuChart to SUPPLE should be relatively straightforward. Basically any instance of BuChart needs to be replaced by SUPPLE. Specific changes which must be made are:
- The compiled parser files are now supple.swi, supple.sicstus, or supple.plcafe
- The GATE wrapper parameter buchartFile is now SUPPLEFile, and it is now of type java.net.URL rather than java.io.File. Details of how to compensate for this in existing saved applications are given in the SUPPLE README file.
- The Prolog wrappers now start shef.nlp.supple.prolog instead of shef.nlp.buchart.prolog
- The mapping.conf file now has lines starting SUPPLE; instead of Buchart;
- Most importantly the main wrapper class is now called nlp.shef.supple.SUPPLE
Making these changes to existing code should be trivial and allow application to benefit from future improvements to SUPPLE.
18.3 Stanford Parser [#]
The Stanford Parser is a probabilistic parsing system implemented in Java by Stanford University’s Natural Language Processing Group. Data files are available from Stanford for parsing Arabic, Chinese, English, and German.
This PR (gate.stanford.Parser) acts as a wrapper around the Stanford Parser and translates GATE annotations to and from the data structures of the parser itself. The plugin is supplied with the unmodified jar file and one English data file obtained from Stanford. Stanford’s software itself is subject to the full GPL.
The parser itself can be trained on other corpora and languages, as documented on the website, but this plugin does not provide a means of doing so. Trained data files are not necessarily compatible between different versions of the parser.
The current versions of the Stanford parser and this PR are threadsafe. Multiple instances of the PR with the same or different model files can be used simultaneously.
18.3.1 Input Requirements
Documents to be processed by the Parser PR must already have Sentence and Token annotations, such as those produced by either ANNIE Sentence Splitter (Sections 6.4 and 6.5) and the ANNIE English Tokeniser (Section 6.2).
If the reusePosTags parameter is true, then the Token annotations must have category features with compatible POS tags. The tags produced by the ANNIE POS Tagger are compatible with Stanford’s parser data files for English (which also use the Penn treebank tagset).
18.3.2 Initialization Parameters
- parserFile
- the path to the trained data file; the default value points to the English data file2 included with the GATE distribution. You can also use other files downloaded from the Stanford Parser website or produced by training the parser.
- mappingFile
- the optional path to a mapping file: a flat, two-column file which the wrapper can use to ‘translate’ tags. A sample file is included.3 By default this value is null and mapping is ignored.
- tlppClass
- an implementation of TreebankLangParserParams, used by the parser itself to extract the dependency relations from the constituency structures. The default value is compatible with the English data file supplied. Please refer to the Stanford NLP Group’s documentation and the parser’s javadoc for a further explanation.
18.3.3 Runtime Parameters
- annotationSetName
- the name of the annotationSet used for input (Token and Sentence annotations) and output (SyntaxTreeNode and Dependency annotations, and category and dependencies features added to Tokens).
- debug
- a boolean value which controls the verbosity of the wrapper’s output.
- reusePosTags
- if true, the wrapper will read category features (produced by an earlier POS-tagging PR) from the Token annotations and force the parser to use them.
- useMapping
- if this is true and a mapping file was loaded when the PR was initialized, the POS and syntactic tags produced by the parser will be translated using that file. If no mapping file was loaded, this parameter is ignored.
The following boolean parameters switch on and off the various types of output that the parser can produce. Any or all of them can be true, but if all are false the PR will simply print a warning to save time (instead of running the parser).
- addPosTags
- if this is true, the wrapper will add category features to the Token annotations.
- addConstituentAnnotations
- if true, the wrapper will mark the syntactic constituents with SyntaxTreeNode annotations that are compatible with the Syntax Tree Viewer (see Section 18.2.4).
- addDependencyAnnotations
- if true, the wrapper will add Dependency annotations to indicate the dependency relations in the sentence.
- addDependencyFeatures
- if true, the wrapper will add dependencies features to the Token annotations to indicate the dependency relations in the sentence.
The parser will derive the dependency structures only if at least one of the dependency output options is enabled, so if you do not need the dependency analysis, set both of them to false so the PR will run faster.
The following parameters control the Stanford parser’s options for processing dependencies; please refer to the Stanford Dependencies Manual4 for details. These parameters are ignored unless at least one of the dependency-related parameters above is true. The default values (Typed and false) correspond to the behaviour of previous version of this PR.
- dependencyMode
- One of the following values:
Mode equivalent command-line option Typed -basic AllTyped -nonCollapsed TypedCollapsed -collapsed TypedCCprocessed -CCprocessed - includeExtraDependencies
- This has no effect with the AllTyped mode; for the others, it determines whether to include “extras” such as control dependencies; if they are included, the complete set of dependencies may not follow a tree structure.
Two sample GATE applications for English are included in the plugins/Parser_Stanford directory: sample_parser_en.gapp runs the Regex Sentence Splitter and ANNIE Tokenizer and then uses this PR to annotate POS tags and constituency and dependency structures, whereas sample_pos+parser_en.gapp also runs the ANNIE POS Tagger and makes the parser re-use its POS tags.
1shef.nlp.supple.prolog.SICStusProlog exists for backwards compatibility and behaves the same as SICStusProlog3.
2resources/englishPCFG.ser.gz
3resources/english-tag-map.txt