TermRaider

The idea behind TermRaider is the automated domain-specific provision of term candidates. It is implemented as part of the GATE Web Services plugin in the NeOn toolkit.

TermRaider is an English term extraction tool that produces noun phrase term candidates from a text corpus together with a statistically derived termhood score. It relies on linguistic pre-processing performed in GATE. First, tokenization and sentence splitting divide up the text into manageable units. Then part of speech tagging and lemmatization allow the inclusion of morpho-syntax into the analysis. Termraider then filters out possible terms by means of a multi-word-unit grammar that defines the possible sequences of part of speech tags constituting noun phrases. The computation of term frequency/inverted document frequency (TF/IDF) [2] [3], a technique widely used in information retrieval and text mining, taking into account term frequency and the number of documents in the collection, yields a score that indicates the salience of each term candidate for each document in the corpus. All term candidates with a TF/IDF score higher than an manually determined threshold are then selected and presented as an Owl ontology.

The ontology has two classes: Token and MultiWord, which contain instances of single nouns and multiword noun phrases respectively. Each instance has at least one FoundInDocument attribute, which contains the source file name(s). Furthermore, each instance has one termhood score attribute tfIdfScore.

Example ontologies resulting from TermRaider can be downloaded here and here .

TermRaider will, like all other GATE web service plugins, run over a text corpus containing documents from a large variety of formats: plain text, HTML, SGML, XML, RTF, and most varieties of PDF and Microsoft Word.

References

1. H. Cunningham, H., Maynard, D., Bontcheva, K. and Tablan, V.: Gate: A Framework
and Graphical Development Environment for Robust Nlp Tools and Applications. In:
Proceedings of the 40th Anniversary Meeting of the Association for Computational
Linguistics (ACL'02) (2002)

2. Buckley, C. and Salton, G., Term-weighting approaches in automatic text retrieval. In: Information Processing and Management, vol. 24, no. 5, pp. 513-523 (1988)

3. Maynard, D., Li, Y.and Peters, W., NLP techniques for term extraction and ontology population. In: Buitelaar, P. and Cimiano, P. (eds.), Ontology Learning and Population: Bridging the Gap between Text and Knowledge, pp. 171-199, IOS Press, Amsterdam (2008)