Sentiment analysis in GATE and Voice of the Customer
The UK National Archives (TNA)
TNA hold 90TB of .gov.uk archives dating back to 1997, with around one billion distinct pages stored. Of these, 35 million have been analysed with GATE Embedded on Amazon EC2, annotated relative to a BigOWLIM semantic repository, and indexed with GATE Mímir. We recently carried out a project with TNA to use GATE and Linked Open Data tools from Ontotext to aid access to their government website records. The system annotated common entity types like people, organisation, location, date, etc., and also more general measurement and amount types (which are also normalised), and some specific types like government department, civil service post, politicians, projects, etc. The system is available as additional functionality in the web search pages at TNA (orientated on end users) and via SPARQL and REST APIs (orientated on developers).
- blog post
- D. Maynard and M. Greenwood. Large Scale Semantic Annotation, Indexing and Search at The National Archives. In Proceedings of LREC 2012, May 2012, Istanbul, Turkey. Download the paper
The BBC is leading the way towards more flexible and intelligent web publishing through Dynamic Semantic Publishing (DSP). The DSP architectural approach now underpins the recently re-launched and refreshed BBC Sports site as well as the BBC's Olympics 2012 online content. BBC Future Media department is using GATE alongside other technologies to realise this approach. The system achieves cost savings of ~80% compared to a conventional database-backed web system.
An important part of DSP is allowing journalists to annotate their work with concepts. For this reason, the BBC have created a tool called "Graffiti". This is where natural language processing (NLP) comes in, and where GATE plays an important role. Concepts are identified in text and suggested to the writer for annotation. To find out more, see this page.
The Press Association is also going full speed with similar efforts, following on from their long-running GATE project that processes the captions in their massive image library.
Media is a perfect application area for our text analysis and semantic modelling technology, partly because journalistic language is very well-behaved (relatively speaking!), partly because the content is extremely valuable, and partly because existing classification schemes are typically applied quite rigorously. Contact us to add GATE to your media systems.
Voice of the Customer
Businesses constantly strive to understand what their customers think about their products and services, what features they would like to see, what problems they experience and so on. Doing this by hand is costly and difficult (in some cases impossible: the volume of customer experience relating to popular products now present on the web is beyond the sizes feasible for traditional market research).
Voice of the customer (VoC) applications use sentiment analysis, information extraction and semantic annotation to mine text and speech for their customers' opinions, problems and desires. Typical inputs to the process include:
- blogs, forums, twitter posts and the like
- customer feedback streams, e.g. text messages, emails, voice mail
GATE-based systems are in use at several VoC suppliers, including a company that analyses customer feedback from some of the largest transportation organisations in the UK, and a New York customer sentiment startup.
GATE has been at use in pharmaceuticals research since the late 1990s. An early system based on GATE at Glaxo allowed navigation around scientific papers according to the chemical and drug terms they contained. At around the same time Merck ran a cluster of 100 machines doing GATE-based annotation of the Medline scientific abstracts database. More recently we've had active users at Astra Zeneca, Roche, Ely Lily, and others.
Ontotext have a suite of products using GATE for Life Sciences applications, including the biggest agglomeration of RDF data derived from bio-science databases.
As part of the LarKC project we recently ran an experiment to try to replicate automatically a result that the WHO's cancer research lab published in Nature last year. This previously published result showed that a particular genetic polymorphism correlates with increased risk of lung cancer. The discovery required a large amount of manual work to examine data from sensor arrays. When analysing this data, the usual statistical techniques need large numbers of samples to make the analysis usable and reliable. In addition, the usual techniques do not make use of any previous knowledge that might have been published about particular genes and the disease.
In our experiment, Bayesian False Discovery Probability (BFDP) was used to take into account prior knowledge about genes. So if, for example, we already know that a gene is expressed in lung tissue, we can allow for this in the BFDP model, when calculating the relevance of sensor data for particular polymorphisms. Prior knowledge about genes is buried in the text of scientific papers, and so to make use of it in BFDP, we used text mining to find those papers that discuss particular genes, diseases, anatomy and so on.
Using BFDP with text mining in an evaluation, we have been able to find genes associated with lung cancer, using half the data that would have been needed by the typical statistical techniques. In terms of the lab work equivalent, the client would have saved around €300k using this technique.
The Spock people search engine (now acquired by Intellius) and the Garlik personal privacy service both rely on GATE for web mining of personal data. Intellius' Andrew Borthwick is a committer on GATE's core SourceForge project.
A typical problem these days is to interpret the results of focussed web crawls. It costs too much (at present) to extract complex data from the whole web, but a focussed approach can make the data sizes small enough to be feasible, but still too costly to analyse by hand. GATE's information extraction / semantic annotation capabilities are often used to pull out the entities, relations and events of interest from the web pages returned by the crawler.
Can you read the Doctor's handwriting? Actually now many healthcare organisations are phasing out hand-written material, and this makes a whole new set of data available to automatic analysis (or, more often in this context, semi-automatic analysis). GATE customers are working with clinical reports in various settings, for example to improve decision support assistance.
The MedCPU product suite uses innovative text mining to hook up their clinical decision assistance systems to the natural language that doctors find easiest to express themselves in - and not a single unreadable squiggle in site!
Job seeking used to involve a marker-pen and the local paper; now it means hours hunkered over a hot laptop discovering more about Google tricks than you really wanted to know. Several companies have GATE-based mining of job adverts and CVs, including Innovantage, a well-established company in Bristol who mine the UK's company sites and supply jobs boards and recruitment companies with the results.
- Populate your calendar with GATE on Jussle.
- The Stationery Office use GATE in the London Gazette project and elsewhere.
- The Press Association use GATE for text mining.
- Solcara and others use GATE for Enterprise Search systems.
- Altaplana uses GATE in text analysis consulting projects.
- Sentimetrix use GATE for sentiment analysis.