
RASP2 plugin for GATE
Previous: Overview and InstallationRASP2 modules
The RASP2 plugin for GATE provides 4 Processing Resources, each one of them requires the annotation types from the previous ones.http://www.informatics.sussex.ac.uk/research/groups/nlp/rasp/offline-demo.html contains a description of the original modules in RASP.
Tokenizer
Creates annotations of type Token, using the information about Sentences. A Token is a simple annotation which contains only a feature 'string' (unlike the default component in GATE which has more information). Another difference with the GATE equivalent is that the Tokenizer requires annotations of type Sentence and hence can't be the first element of a pipeline. Check out the resources pages on the DigitalPebble website for our Toolbox which contains a Sentence Splitter. The original GATE tokenizer can be used instead of the RASP one.Runtime Parameters
inputASName | AnnotationSet where the Sentences are taken from |
outputASName | AnnotationSet where the Tokens are generated |
debug | Keeps the file returned by the Tokenizer in the /tmp directory |
charset | Specifies the charset to use for the communication with the Tokenizer |
POS Tagger
The Part of Speech Tagger generates annotations of type WordForms. This separation between Tokens and WForms is based on the MAF ISO proposal.The tagset is close to CLAWS C7 (see e.g. Appendix C of Jurafsky, D. and Martin, J. Speech and Language Processing, Prentice-Hall, 2000 for more details), although it is in fact a cut down version of the CLAWS C2 tagset.
A WordForm gets a POS attribute, which is a simple String and a probability.
Init Parameter
raspHome | Directory where RASP is installed |
Runtime Parameters
inputASName | AnnotationSet where Sentences and Tokens are taken from |
outputASName | AnnotationSet where the WordForms are generated |
debug | Keeps the file returned by the POSTagger in the /tmp directory |
charset | Specifies the charset to use for the communication with the POSTagger |
generateMultipleTags | Generates one or more WordForms per Token |
POS Converter
The Part of Speech Converter achieves a similar functionality as above but instead of getting the POS tags from the original RASP component, it converts the tags generated by the default GATE POS tagger (PennTreebank tagset) into the tagset used by RASP. This has an advantage of speeding up the processing at the cost of possible inaccuracies in the conversion.Runtime Parameters
inputASName | AnnotationSet where WordForms are taken from |
outputASName | AnnotationSet where the WordForms are generated |
grammarURL | URL to the JAPE grammar file |
encoding | The encoding used for reading the grammar |
Morphological Analyser
Next the tagger output is lemmatized, based on the tags assigned to word tokens. See Briscoe and Carroll (2002) for further details and a reference to a detailed paper describing this module. The Morphological Analyzer adds an attribute lemma to the WordForms found in the input AnnotationSet.Init Parameter
raspHome | Directory where RASP is installed |
Runtime Parameters
inputASName | AnnotationSet where WordForms are taken from |
debug | Keeps the file returned by the Morpher in the /tmp directory |
charset | Specifies the charset to use for the communication with the Morpher |
Parser
The probabilistic parser analyses the PoS tag sequence or chart of initial more probable tags and generates a parse forest representation containing all possible subanalyses with associated probabilities. From this representation it is able to construct the n-best (weighted) grammatical relations.The parser generates annotations of the type Dependency. Dependencies have a type and subtype and link to two WordForms as head and dependency. This implementation does not generate annotations for Clauses.
Note that the Parser requires a recent machine with minimum 1.5G of RAM.
Init Parameter
raspHome | Directory where RASP is installed |
Runtime Parameters
inputASName | AnnotationSet where WordForms and Sentences are taken from |
outputASName | AnnotationSet where the Dependencies are generated |
debug | Keeps the file returned by the Parser in the /tmp directory |
charset | Specifies the charset to use for the communication with the Parser |
subcategorisation | Turns subcategorisation on or off |
phrasalVerbs | Turns the use of phrasal verbs on or off |
outputFormat | Format returned by the Parser. Authorized values are "-og","-ogio", "-ogw". See RASP documentation for more information. |