Visualizing the Data Produced by the Political Futures Tracker
In the previous instalments of this series, taking you behind the scenes of the Political Futures Tracker, we looked at how the topic and sentiment analysis was performed, as well as how we are able to process the data from Twitter in real time. In this third part, we focus on the ways in which the Political Futures Tracker turns this raw data into visualizations which allow us to quickly see how the topic and sentiment of online discussion shifts in real time.
In the previous instalment, we explained how the data was captured from Twitter and then indexed and linked to an ontology containing information about candidates and former MPs, which allow us to formulate complex queries such as:
Find all positive sentiment expressions about the "UK economy" theme in tweets written by Labour candidates for constituencies in Greater London
which can be used to produce interesting visualizations. How that happens is the subject of this short article.
Building the visualizations for the Political Futures Tracker consists of two stages. First, queries are developed to extract the raw statistics data from the indexed documents. In the second phase, this raw data is used to drive interactive web based visualizations.
In general, visualizations are a good way of quickly presenting statistics in some form or another, rather than raw data. While a single query, such as the one above, returns interesting information, it is designed more for finding specific examples rather than visualizing sentiment at a given time point or changes over time. It is easy to see though how such queries could be generalised to gather statistics. For example we could issue two queries:
Count all the positive sentiment expressions about the "UK economy" theme in tweets written by Labour candidates for constituencies in Greater London
Count all the negative sentiment expressions about the "UK economy" theme in tweets written by Labour candidates for constituencies in Greater London
We could then use this to determine if the average sentiment is positive or negative. While this is now allowing us to gather statistics rahter than examples, further generalization allows us to generate data covering more of the collected tweets, and to assemble more information within a single visualization. Essentially we take such a query and turn it into a template:
Count all the <sentiment> expressions about the <theme> in tweets written by <party> candidates for constituencies in <region>
Each of these template slots can take on multiple values:
- sentiment: can be either positive or negative
- theme: we recognise 45 different political themes
- party: we focus on the seven main UK political parties
- region: the UK consists of 12 main regions (known as NUTS 1 regions)
In theory, we could run a query for every combination of values, which would give us 7,560 data points just for this single query. We refer to this as "query explosion", as one query can produce a vast number of data points. In reality, often many of the themes are not talked about that much, and so we have tended to focus on the top ten themes discussed over the specific time point.
Time is the other aspect of the data that we have not yet discussed. In the run up to the election, we were regularly looking at two forms of time periods. First, we looked at the most recent week or month, which allowed us to see the main themes rise and fall as the different campaigns highlighted different topics. The same approach, albeit on a smaller time scale, was used during the televised debates, where we generated statistics for the last five minutes of tweets to see how the public responded to the different questions and speakers. The main point here is that each visualization concerned data from a single time period. The second approach we used subdivided a time period into short segments to give a clearer picture of changes in data over time. These usually revolved around tracking the usage of a hashtag in the run up to a debate, and divided the day into 5 minute blocks. Obviously the more time periods, the more queries are required and the more data that is generated.
Building the Visualizations
Early in the project, we produced a number of static graphs which helped to summarise quickly the data being produced. While these static graphs were really helpful, there is still a limit to the amount of data that can be displayed. Interactive visualizations are not only more interesting for people to use, but they allow a much larger volume of data to be presented quickly.
We produced a number of interactive visualizations which could be accessed and explored with just a web browser. More information about these can be found on Nesta's blog.