Visualizing the Data Produced by the Political Futures Tracker

In the previous instalments of this series, taking you behind the scenes of the Political Futures Tracker, we looked at how the topic and sentiment analysis was performed, as well as how we are able to process the data from Twitter in real time. In this third part, we focus on the ways in which the Political Futures Tracker turns this raw data into visualizations which allow us to quickly see how the topic and sentiment of online discussion shifts in real time.

In the previous instalment, we explained how the data was captured from Twitter and then indexed and linked to an ontology containing information about candidates and former MPs, which allow us to formulate complex queries such as:

Find all positive sentiment expressions about the "UK economy" theme in tweets written by Labour candidates for constituencies in Greater London

which can be used to produce interesting visualizations. How that happens is the subject of this short article.

Building the visualizations for the Political Futures Tracker consists of two stages. First, queries are developed to extract the raw statistics data from the indexed documents. In the second phase, this raw data is used to drive interactive web based visualizations.

Exploding Queries

In general, visualizations are a good way of quickly presenting statistics in some form or another, rather than raw data. While a single query, such as the one above, returns interesting information, it is designed more for finding specific examples rather than visualizing sentiment at a given time point or changes over time. It is easy to see though how such queries could be generalised to gather statistics. For example we could issue two queries:

Count all the positive sentiment expressions about the "UK economy" theme in tweets written by Labour candidates for constituencies in Greater London

Count all the negative sentiment expressions about the "UK economy" theme in tweets written by Labour candidates for constituencies in Greater London

We could then use this to determine if the average sentiment is positive or negative. While this is now allowing us to gather statistics rahter than examples, further generalization allows us to generate data covering more of the collected tweets, and to assemble more information within a single visualization. Essentially we take such a query and turn it into a template:

Count all the <sentiment> expressions about the <theme> in tweets written by <party> candidates for constituencies in <region>

Each of these template slots can take on multiple values:

sentiment: can be either positive or negative
theme: we recognise 45 different political themes
party: we focus on the seven main UK political parties
region: the UK consists of 12 main regions (known as NUTS 1 regions)

In theory, we could run a query for every combination of values, which would give us 7,560 data points just for this single query. We refer to this as "query explosion", as one query can produce a vast number of data points. In reality, often many of the themes are not talked about that much, and so we have tended to focus on the top ten themes discussed over the specific time point.

Time is the other aspect of the data that we have not yet discussed. In the run up to the election, we were regularly looking at two forms of time periods. First, we looked at the most recent week or month, which allowed us to see the main themes rise and fall as the different campaigns highlighted different topics. The same approach, albeit on a smaller time scale, was used during the televised debates, where we generated statistics for the last five minutes of tweets to see how the public responded to the different questions and speakers. The main point here is that each visualization concerned data from a single time period. The second approach we used subdivided a time period into short segments to give a clearer picture of changes in data over time. These usually revolved around tracking the usage of a hashtag in the run up to a debate, and divided the day into 5 minute blocks. Obviously the more time periods, the more queries are required and the more data that is generated.

Building the Visualizations

Early in the project, we produced a number of static graphs which helped to summarise quickly the data being produced. While these static graphs were really helpful, there is still a limit to the amount of data that can be displayed. Interactive visualizations are not only more interesting for people to use, but they allow a much larger volume of data to be presented quickly.

We produced a number of interactive visualizations which could be accessed and explored with just a web browser. More information about these can be found on Nesta's blog.

Developing rich interactive web-based visualization is made easy by the large proliferation of JavaScript libraries designed specifically for the task. We used D3.js and Leaflet to build all the visualizations produced in this project. These libraries not only make it easy to display data in interesting ways, but they help to ensure that the data and the visualization are kept separate. This separation is very useful as it allows us to produce updated data rapidly (or as needed) without having to change the display code, allowing the visualizations to change in response to the data. This was especially important during the debates where the data was being regenerated every minute so that changes in topic sentiment could be easily visualized.