TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

1. Introduction

NLP on social media data is hard. Content is often brief, contains mistakes, lacks context, and is uncurated - very different from the well-formed news text that tools typically operate over.

TwitIE is a GATE pipeline for Information Extraction over tweets, one of the noisiest forms of social media text.

2. Resources provided

TwitIE is a full GATE pipeline, containing the following customised components:

Social media data Language identification;
Twitter tokeniser, for handling smilies, user names, URLS and so on;
Twitter part-of-speech tagger, which is also available on its own
Text normalisation PR

TwitIE is available as part of the GATE Twitter plugin (release 8.1 onwards, in SVN or via a nightly build of GATE).

To use TwitIE, load both the Twitter and Tagger_Stanford plugins, and then choose TwitIE from the Ready Made Applications menu.

Support is available via the gate-users mailing list, which you may also search to see if your problem has been addressed before.

3. Documentation

The paper describing TwitIE, which includes friendly screenshots and an idea of the flow of the application, can be found here: twitie-ranlp2013.pdf.

For the end-user, TwitIE works in a very similar way to ANNIE, which is documented in the main GATE user guide. Guide to ANNIE: a Nearly-New Information Extraction System

We also describe our named entity recognition performance and analysis in an earlier paper, presented at Hypertext 2013 - ner_issues.pdf. This involved an investigation into the difficulties involved in doing named entity recognition and named entity disambiguation for twitter. Based on these findings, the TwitIE pipeline was augmented, alongside improvements made as part of the ARCOMEM project. The resulting Twitter pipeline reached an F1-measure of 80% on a common tweet NER dataset (from Ritter).

Correction of slang and mis-spelling is done with a normalisation PR. The majority of non-dictionary words (80% of occurrences) are made up of just a few variants in English tweets, so we include a direct mapping for fixing the most part of these. For the remaining 20% of unrecognised words, we try to find a dictionary entry that is only a few letters different (for typos) or sounds similar (for things like "yeeeaaaaah!" compared to "yeah!"). This gives us fast, effective normalisation.

You can find alternative PoS-tagger models that work with TwitIE's Tagger_Twitter via the GATE Twitter PoS tagger page.

4. How to cite

Please acknowledge TwitIE if you have found it useful in your work. Use this reference:

K. Bontcheva, L. Derczynski, A. Funk, M.A. Greenwood, D. Maynard and N. Aswani. 2013. "TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text". In Proceedings of the International Conference on Recent Advances in Natural Language Processing, ACL.

Or, if you prefer BiBTeX:

@inproceedings{twitie,
    title = {{TwitIE}: An Open-Source Information Extraction Pipeline for Microblog Text},
    author = {Bontcheva, Kalina and Derczynski, Leon and Funk, Adam and Greenwood, Mark A. and  Maynard, Diana and Aswani, Niraj},
    year = {2013},
    booktitle = {Proceedings of the International Conference on Recent Advances in Natural Language Processing},
    publisher = {Association for Computational Linguistics}
}

5. Acknowledgements

Research supported by UK EPSRC grants Nos. EP/I004327/1 and EP/K017896/1 uComp and by the European Union under grant agreements No. 611233 PHEME, No. 610829 DecarboNet, and No. 270239 Arcomem, and No. 687847 COMRADES.

6. Release history

2013-09-05 Initial packaging and upload