TermRaider is a term extraction tool developed as part of GATE, within the NeOn and Arcomem projects. It produces noun phrase term candidates from a text corpus, together with a statistically derived termhood score.
The original version developed within NeOn only used a basic tf.idf score for calculating termhood, and was designed to work on regular kinds of text such as news articles. It also did not make use of stopwords and had very limited GUI functionality, unlike the existing tool, which has been adapted within the Arcomem project to work additionally on social media and on more degraded forms of text.
TermRaider identifies term candidates (TCs) in the documents and scores them using three different calculations:
- basic tf.idf [Buckley88]: (1+log(tf)) * log(n/df) where tf is the frequency of the TC in corpus, df is the TC's document frequency (number of documents in which it occurs), and n is the number of documents in the corpus;
- augmented tf.idf (aug.tf.idf), which is each TC's maximum value of local augmented tf.idf, which for each occurrence of a TC is that TC's tf.idf score plus the tf.idf scores of all hyponymous TCs found surrounding that occurrence;
- Kyoto domain relevance score [Bosma2010], df* (1+nh), where df is a TC's document frequency and nh is the number of its distinct hyponymous TCs found in the corpus.
All scores are normalized to values between 0 and 100.
GATE's linguistic processing components identify nouns and noun phrases (NPs) as TCs, but exclude any which are contained within or align exactly with named entities (NEs) identified by ANNIE. TC identification and exclusion are carried out before scoring. After scoring, TCs with a normalized score at or above 45 (augmented tf.idf) are treated as instances of Term. This threshold can be adjusted empirically, and the removal of NEs can also be suppressed if necessary.
A version of TermRaider has also been developed for German, using the German Named Entity Recognition plugin in GATE.