Log in Help
Print
Homereleasesgate-5.1-beta2-build3402-ALLpluginsParser_RASPdoc 〉 modules.html
 

RASP2 plugin for GATE 

Previous: Overview and Installation    

RASP2 modules

The RASP2 plugin for GATE provides 4 Processing Resources, each one of them requires the annotation types from the previous ones.
http://www.informatics.sussex.ac.uk/research/groups/nlp/rasp/offline-demo.html contains a description of the original modules in RASP.

Tokenizer

Creates annotations of type Token, using the information about Sentences. A Token is a simple annotation which contains only a feature 'string' (unlike the default component in GATE which has more information). Another difference with the GATE equivalent is that the Tokenizer requires annotations of type Sentence and hence can't be the first element of a pipeline. Check out the resources pages on the DigitalPebble website for our Toolbox which contains a Sentence Splitter. The original GATE tokenizer can be used instead of the RASP one.

Runtime Parameters

inputASNameAnnotationSet where the Sentences are taken from
outputASNameAnnotationSet where the Tokens are generated
debugKeeps the file returned by the Tokenizer in the /tmp directory
charsetSpecifies the charset to use for the communication with the Tokenizer

POS Tagger

The Part of Speech Tagger generates annotations of type WordForms. This separation between Tokens and WForms is based on the MAF ISO proposal.
The tagset is close to CLAWS C7 (see e.g. Appendix C of Jurafsky, D. and Martin, J. Speech and Language Processing, Prentice-Hall, 2000 for more details), although it is in fact a cut down version of the CLAWS C2 tagset.

A WordForm gets a POS attribute, which is a simple String and a probability.

Init Parameter

raspHomeDirectory where RASP is installed

Runtime Parameters

inputASNameAnnotationSet where Sentences and Tokens are taken from
outputASNameAnnotationSet where the WordForms are generated
debugKeeps the file returned by the POSTagger in the /tmp directory
charsetSpecifies the charset to use for the communication with the POSTagger
generateMultipleTagsGenerates one or more WordForms per Token

POS Converter

The Part of Speech Converter achieves a similar functionality as above but instead of getting the POS tags from the original RASP component, it converts the tags generated by the default GATE POS tagger (PennTreebank tagset) into the tagset used by RASP. This has an advantage of speeding up the processing at the cost of possible inaccuracies in the conversion.

Runtime Parameters

inputASNameAnnotationSet where WordForms are taken from
outputASNameAnnotationSet where the WordForms are generated
grammarURLURL to the JAPE grammar file
encodingThe encoding used for reading the grammar

Morphological Analyser

Next the tagger output is lemmatized, based on the tags assigned to word tokens. See Briscoe and Carroll (2002) for further details and a reference to a detailed paper describing this module. The Morphological Analyzer adds an attribute lemma to the WordForms found in the input AnnotationSet.

Init Parameter

raspHomeDirectory where RASP is installed

Runtime Parameters

inputASNameAnnotationSet where WordForms are taken from
debugKeeps the file returned by the Morpher in the /tmp directory
charsetSpecifies the charset to use for the communication with the Morpher

Parser

The probabilistic parser analyses the PoS tag sequence or chart of initial more probable tags and generates a parse forest representation containing all possible subanalyses with associated probabilities. From this representation it is able to construct the n-best (weighted) grammatical relations.

The parser generates annotations of the type Dependency. Dependencies have a type and subtype and link to two WordForms as head and dependency. This implementation does not generate annotations for Clauses.

Note that the Parser requires a recent machine with minimum 1.5G of RAM.

Init Parameter

raspHomeDirectory where RASP is installed

Runtime Parameters

inputASNameAnnotationSet where WordForms and Sentences are taken from
outputASNameAnnotationSet where the Dependencies are generated
debugKeeps the file returned by the Parser in the /tmp directory
charsetSpecifies the charset to use for the communication with the Parser
subcategorisationTurns subcategorisation on or off
phrasalVerbsTurns the use of phrasal verbs on or off
outputFormatFormat returned by the Parser. Authorized values are "-og","-ogio", "-ogw". See RASP documentation for more information.