Annotated Datasets

Here you can find some gold standard datasets created as part of the DecarboNet project. The data was originally downloaded from the Media Watch for Climate Change.

The datasets are made available as dehydrated json files, one for each corpus, in order to comply with tweet distribution regulations. To rehydrate them, please download the scripts from Github and follow the instructions.

License: the annotations are provided under a CC-BY licence, while Twitter retains the ownership and rights of the content of the tweets.

1. Corpora annotated with Sentiment

Earth Hour 2015 corpus: contains 600 tweets annotated with Sentiment information (Positive, Negative, Neutral). The annotations were crowdsourced and triple annotated — for more information see the following paper:
- D. Maynard and K. Bontcheva. Challenges of Evaluating Sentiment Analysis Tools on Social Media. In Proc. of Language Resources and Evaluation Conference (LREC), May 2016, Portoroz, Slovenia.

2. Corpora annotated with Environmental Terms

Climate change corpus: contains 456 tweets about climate change. The tweets were double-annotated, and conflicts resolved by a linguist with expertise in term recognition.
Fracking corpus: contains 377 tweets about fracking, the Arctic and drilling. The tweets were double-annotated, and conflicts resolved by a linguist with expertise in term recognition.
Energy corpus: contains 413 tweets about home energy. The tweets were double-annotated, and conflicts resolved by a linguist with expertise in term recognition.