GATE.ac.uk - sale/tao/splitch17.html

Chapter 17
Tools for Social Media Data [#]

Social media provides data that is highly valuable to many organizations, for example as a way to track public opinion about a company’s products or to discover attitudes towards “hot topics” and breaking news stories. However, processing social media text presents a set of unique challenges, and text processing tools designed to work on longer and more well-formed texts such as news articles tend to perform badly on social media. To obtain reasonable results on short, inconsistent and ungrammatical texts such as these requires tools that are speciﬁcally tuned to deal with them.

This chapter discusses the tools provided by GATE for use with social media data.

17.1 Tools for Twitter [#]

The Twitter tools in GATE are provided in two plugins. The Format_Twitter plugin contains tools to load and save documents in GATE using the JSON format provided by the Twitter APIs, and the Twitter plugin contains a tokeniser and POS tagger tuned to Tweets, a tool to split up multi-word hashtags, and an example named entity recognition application called TwitIE which demonstrates all these components working together. The Twitter plugin makes use of PRs from the Stanford_CoreNLP plugin, which will be loaded automatically when the Twitter plugin is loaded.

The GATE Cloud version of TwitIE can be found here:
https://cloud.gate.ac.uk/shopfront/displayItem/twitie-named-entity-recognizer-for-tweets

17.2 Twitter JSON format [#]

Twitter provides APIs to search for Tweets according to various criteria, and to collect streams of Tweets in real-time. These APIs return the Tweets in a structured JSON format¹ which

Loading the plugin registers the document format with GATE, so that it will be automatically associated with ﬁles whose names end in “.json”; otherwise you need to specify text/x-json-twitter for the document mimeType parameter. This will work both when directly creating a single new GATE document and when populating a corpus.

Each top level tweet is loaded into a GATE document and covered with a Tweet annotation. Each of the tweets it contains (retweets, quoted tweets etc. are then added to the document and covered with a TweetSegment annotation². Each TweetSegment annotation has three features textPath, entitiesPath, and tweetType. The latter of these tells you the type of tweet i.e. retweet, quoted etc. whereas the ﬁrst two give the dotted path through the JSON object to the ﬁelds from which text and entities were extracted to produce that segment. All the JSON data is added as nested features on the top level Tweet annotation.

Multiple tweet objects in the same JSON ﬁle are separated by blank lines (which are not covered by Tweet annotations). Should you have such ﬁles and want to split them into multipe GATE documents, then you can do this using the populator provided by the Format: JSON plugin by setting the MIME type to text/x-json-twitter. You can even set the name of the document to the ID of the tweet by setting the document ID parameter in the dialog to /id_str. See Section 23.30 for more details.

17.2.1 Entity annotations in JSON [#]

Twitter’s JSON format provides a mechanism to represent annotations over the Tweet text as standoﬀ markup, via a JSON property named “entities”. The value of this property is an object with one property for each entity type, whose value is a list of objects representing the individual annotations. Within each individual entity object, the “indices” property gives start and end character oﬀsets of the annotation within the Tweet text.

{
  ...
  "full_text":"@some_user this is a nice #example",
  "entities":{
    "user_mentions":[
      {
        "indices":[0,10],
        "screen_name":"some_user",
        ...
      }
    ],
    "hashtags":[
      {
        "indices":[26,34],
        "text":"example"
      }
    ]
  }
}

When loaded into GATE the entity type (e.g. user_mentions) becomes the annotation type, the indices property provides the oﬀsets, and the other properties become features of the generated annotation.

By default, the entity annotations are created in the “Original markups” annotation set, as is the usual convention for annotations generated by a document format. However, if the entity type contains a colon character (e.g. "Key:Person":[...]) then the portion before the colon is taken to be an annotation set name and the portion after the colon is the annotation type (in this example, a “Person” annotation in the “Key” annotation set). An empty annotation set name (i.e. ":Person") creates the corresponding annotations in the default annotation set. This scheme is designed to be compatible with the GATE JSON export mechanism described in the next section.

17.3 Exporting GATE documents as JSON [#]

Loading the Format_Twitter plugin also adds a “Twitter JSON” option to the “Save as…” right-click menu on documents and corpora, to export GATE documents in the Twitter-style JSON format. This tool can save a document or corpus of documents as a single ﬁle where each Tweet in the document or corpus is represented as a JSON object, and the set of objects are represented either as a single top-level JSON array ([{...},{...}]) or simply as one object per line (as per Twitter’s streaming APIs). This exporter can only be used on documents loaded from Twitter JSON (or which has the same structure) as it relies on the Tweet and TweetSegment annotations to store the information back correctly into the original JSON structure.

The available options for the JSON exporter are:

entitiesAnnotationSetName: the primary annotation set that should be scanned for entity annotations.
annotationTypes: the entity annotation types to output.
exportAsArray: if true, output the objects as a top-level JSON array. If false (the default), output the JSON objects directly at the top level, separated by newlines.

Annotation types to be saved can be speciﬁed in two ways. Plain annotation type names such as “Person” will be taken from the speciﬁed entitiesAnnotationSetName, but if a type name contains a colon character (e.g. “Key:Person”) then the portion before the colon is treated as the annotation set name and the portion after the colon as the annotation type. The full name including the colon will be used as the type label in the “entities” object, so if the resulting JSON were re-loaded into GATE the annotations would be re-created in the same annotation sets they originally came from.

17.4 Low-level PRs for Tweets [#]

The Twitter plugin provides a number of low-level language processing components that are speciﬁcally tuned to Twitter data.

The “Twitter Tokenizer” PR is a specialization of the ANNIE English Tokeniser for use with Tweets. There are a number of diﬀerences in the way this tokeniser divides up the text compared to the default ANNIE PR:

URLs and abbreviations (such as “gr8” or “2day”) are treated as a single token.
User mentions (@username) are two tokens, one for the @ and one for the username.
Hashtags are likewise two tokens (the hash and the tag), but see below for another component that can split up multi-word hashtags.
“Emoticons” such as :-D can be treated as a single token. This requires a gazetteer of emoticons to be run before the tokeniser, an example gazetteer is provided in the Twitter plugin. This gazetteer also normalises the emoticons to help with classiﬁcation, machine learning etc. For example, :-D, and 8D are both normalized to :D.

The “Tweet Normaliser” PR uses a spelling correction dictionary to correct mis-spellings and a Twitter-speciﬁc dictionary to expand common abbreviations and substitutions. It replaces the string feature on matching tokens with the normalised form, preserving the original string value in the origString feature.

The “Twitter POS Tagger” PR uses the Stanford Tagger (section 23.22) with a model trained on Tweets. The POS tagger can take advantage of expanded strings produced by the normaliser PR.

17.5 Handling multi-word hashtags [#]

When rendering a Tweet on the web, Twitter automatically converts contiguous sequences of alpha-numeric characters following a hash (#) into links to search for other Tweets that include the same string. Thus “hashtags” have rapidly become the de-facto standard way to mark a Tweet as relating to a particular theme, event, brand name, etc. Since hashtags cannot contain white space, it is common for users to form hashtags by running together a number of separate words, sometimes in “camel case” form but sometimes simply all in lower (or upper) case, for example “#worldgonemad” (as search queries on Twitter are not case-sensitive).

The “Hashtag Tokenizer” PR attempts to recover the original discrete words from such multi-word hashtags. It uses a large gazetteer of common English words, organization names, locations, etc. as well as slang words and contractions without the use of apostrophes (since hashtags are alphanumeric, words like “wouldn’t” tend to be expressed as “wouldnt” without the apostrophe). Camel-cased hashtags (#CamelCasedHashtag) are split at case changes.

More details, and an example usecase, can be found in [Maynard & Greenwood 14].

The output of the hashtag tokenizer is two fold. Firstly the Token annotations with the span of the hashtag are modiﬁed so as to accurate reﬂect the words within the hashtag. This allows PRs further down the pipeline to treat the sections of the hashtag as individual words for NE or sentiment analysis etc. Secondly a tokenized feature is added to each Hashtag annotation. This is a lower case version of the hashtag with Unicode ‘HAIR SPACE’ (U+200A) characters inserted between the separate tokens. This means that the feature continues, on ﬁrst glance, to look like a hashtag (i.e. no spaces) but if two hashtags are tokenized diﬀerently the spacing becomes more obvious to the human eye. This means that in general you can use the tokenized feature to group tweets by hashtag which takes into account diﬀerent formatting and case while still allowing them to be treated diﬀerently when they represent semantically diﬀerent concepts.

17.6 The TwitIE Pipeline [#]

The Twitter plugin includes a sample ready-made application called TwitIE, which combines the PRs described above with additional resources borrowed from ANNIE and the TextCat language identiﬁcation PR to produce a general-purpose named entity recognition pipeline for use with Tweets. TwitIE includes the following components:

Annotation Set Transfer to transfer Tweet annotations from the Original markups annotation set. For documents loaded using the JSON document format or corpus population logic, this means that each Tweet will be covered by a separate Tweet annotation in the ﬁnal output of TwitIE. Hashtags, URLs, UserMentions, and Symbols appearing in the original JSON are also transferred (and renamed appropriately) into the default set.
Language identiﬁcation PR (see section 15.1) using language models trained on English, French, German, Dutch and Spanish Tweets. This creates a feature lang on each Tweet annotation giving the detected language.
Twitter tokenizer described above, including a gazetteer of emoticons.
Hashtag tokenizer to split up hashtags consisting of multiple words.
The standard ANNIE gazetteer and sentence splitter.
Normaliser and POS tagger described above.
Named entity JAPE grammars, based largely on the ANNIE defaults but with some customizations.