Log in Help
Homeprojectsarcomem 〉 TermRaider.html


TermRaider is a term extraction tool developed as part of GATE, within the NeOn and Arcomem projects. It produces noun phrase term candidates from a text corpus, together with a statistically derived termhood score.

The original version developed within NeOn only used a basic tf.idf score for calculating termhood, and was designed to work on regular kinds of text such as news articles. It also did not make use of stopwords and had very limited GUI functionality, unlike the existing tool, which has been adapted within the Arcomem project to work additionally on social media and on more degraded forms of text.

TermRaider identifies term candidates (TCs) in the documents and scores them using three different calculations:

All scores are normalized to values between 0 and 100.

GATE's linguistic processing components identify nouns and noun phrases (NPs) as TCs, but exclude any which are contained within or align exactly with named entities (NEs) identified by ANNIE. TC identification and exclusion are carried out before scoring. After scoring, TCs with a normalized score at or above 45 (augmented tf.idf) are treated as instances of Term. This threshold can be adjusted empirically, and the removal of NEs can also be suppressed if necessary.

A version of TermRaider has also been developed for German, using the German Named Entity Recognition plugin in GATE.