GATE Twitter part-of-speech tagger
1. Introduction
Part-of-speech tagging tweets is hard. This is our state-of-the-art tagger. The tagger achieves competitive accuracy, and uses the Penn Treebank tagset, so that all your other tools should integrate seamlessly.
The tagger is an adapted and augmented version of a leading CRF-based tagger, customised for English tweets. It's released as both a GATE PR and also a standalone command-line tool (Java, so any operating system). It achieves 91% accuracy on tokens on our evaluation set, which is very high for this genre. Importantly, it has relatively high whole-sentence-correct performance. Good performance at getting the whole sentence right is crucial for tasks like dependency parsing and event extraction.
2. Resources provided
The GATE Twitter PoS tagger is distributed in a number of ways - choose whichever suits your needs best.
- First, as part of the Twitter plugin for GATE (currently available via SVN or the nightly builds)
- Second, as a standalone Java program, again with all features, as well as a demo and test dataset - twitie-tagger.zip;
- Third, a model for the Stanford tagger v3.3.1 and v3.4, distributed as a single file, for use in existing applications (this excludes handling of slang and prior probabilities) - gate-EN-twitter.model;
The default model is included with these. Use is detailed in the README. As with the rest of GATE, the tagger is licensed under the LGPLv3, and contains an instance of the Stanford Tagger.
A high-speed model file is also available that trades about 2.5% token accuracy for doubled pace - gate-EN-twitter-fast.model
We also provide the bootstrapped corpus, allowing replication of our results with this large, high-confidence dataset (97.5% accuracy).
- Bootstrapped PoS-tagged corpus (one sentence per line, space tokenised, PTB tagset) - twitter_bootstrap_corpus.tar.gz
- Script used to create the bootstrapped corpus from a set of tweets labeled with both ARK roughPOS and PTB tags - corpusdiff.py
Finally, the tagger is distributed in our full social media processing pipeline, TwitIE.
3. Documentation
The paper describing the tagger can be found here: twitter_pos.pdf.
Requires Java 1.6.0 or above.
The input file should contain one plaintext tweet per line, with spaces separating tokens. General use if of the form:
java -jar twitie_tag.jar <path to model file> <path to input file>
To run the tagger using the best model reported in the paper, do:
java -jar twitie_tag.jar models/english-twitter.model <input file>
Tagged tokens are output on stdout; status information on stderr. This means that if you want to save the output, simply redirect stdout;
java -jar twitie_tag.jar <path to model file> \ <path to input file> > <output file>
For example:
java -jar twitie_tag.jar vcboot.1543K-twitter.model \ corpora/ritter_dev.nolabels > ritter_dev.tagged
There is a known "InvocationTargetException" problem when using Apple's own Java distribution; using the OpenJDK JRE can remedy this.
4. How to cite
Please acknowledge the tagger if you have found it useful in your work. The paper describing the tagger can be found here: twitter_pos.pdf. Use this reference:
L. Derczynski, A. Ritter, S. Clarke, and K. Bontcheva. 2013. "Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data". In Proceedings of the International Conference on Recent Advances in Natural Language Processing, ACL.
Or, if you prefer BiBTeX:
@inproceedings{ title = {Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data}, author = {Derczynski, Leon and Ritter, Alan and Clark, Sam and Bontcheva, Kalina}, year = {2013}, booktitle = {Proceedings of the International Conference on Recent Advances in Natural Language Processing}, publisher = {Association for Computational Linguistics} }
5. Pennn TreeBank part of Speech Standardization
In order to maximize the interoperability of the part of speech annotations provided by the twitter tagger and GATE's ANNIE tagger, a GATE pipeline has been created which adds various notations derived from widely used de facto standards/best practice descriptors. Standard representations for Penn TreeBank part of speech tags will embed these in a larger network of part of speech vocabularies, and enable interoperability with a wide range of part of speech tag sets.
6. Updates
2014 June 18 - Update to CoreNLP v3.4; fix paths to lookup lists
2014 April 11 - Update models and standalone tagger to v3.3.1 of Stanford CoreNLP; add vote-constrained data from the ARK datasets (40% extra gold data); include examples using variations of hashtag tokenisation
2013 August 28 - Include various instances of contractions in training data, to reduce sensitivity to other tokenisation schemes
2013 July 30 - Updated default model to use all available data (including T-Pos dev. and eval sets)
2013 July 15 - Original release