BBC and GATE
The BBC is leading the way towards more flexible and intelligent web publishing through Dynamic Semantic Publishing (DSP). The DSP architectural approach now underpins the recently re-launched and refreshed BBC Sports site as well as the BBC's Olympics 2012 online content, as described in this blog post, by Jem Rayfield, Lead Technical Architect at the BBC's News and Knowledge Core Engineering department. BBC Future Media department is using GATE alongside other technologies to realise this approach. The system achieves cost savings of ~80% compared to a conventional database-backed web system.
For extensive websites such as the BBC's site, DSP offers the power to author, structure and maintain websites more efficiently by basing them on a deep world model and a better understanding of their content. Metadata-driven page generation frees up journalistic time for more complex content, leaving the computer to link up concepts and organise material, and write pages such as competition result listings and timelines.
The approach rests on a model of the domain, known as an ontology. This ontology describes the concepts relevant to the domain and the relationships between them. For example, in the 2012 Olympics, relevant concepts are athletes, teams and events, and relationships between them would indicate that athletes are members of teams and that athletes participate in events. Journalist-authored content such as stories and blogs is also included in the domain model and associated with relevant concepts. Therefore all stories with a concept relationship to "decathlon", for example, would provide the basis for a dynamic query aggregation for a page about stories related to the decathlon. Basic concepts such as time and location are also required. Having created a model of the domain, web pages can be automatically generated based on this content.
A large domain model such as that required for the Olympics results in a large and complex site, and robust and scalable technologies are required to support it. Relations are stored as linked open data identifiers. Triples (concept pairings with relationship between them) are stored using BigOWLIM.
An important part of integrating journalist-authored content with the domain model is allowing journalists to annotate their work with concepts. For this reason, the BBC have created a tool called "Graffiti". This is where natural language processing (NLP) comes in, and where GATE plays an important role. Concepts are identified in text and suggested to the writer for annotation. The NLP functionality needs to be fully integrated with the domain model in order to suggest the right concepts. In the figure below, taken from Jem's article, you can see where GATE fits in with the other technologies to achieve this:
The approach has proven successful in the BBC World Cup 2010 website (see this blog post for more detail) and its use for the Olympics 2012 website further demonstrates the power of the approach in facilitating high quality web publishing, as well as being part of a movement toward open linked data. Linking concepts via this approach offers tremendous possibilities for sharing machine-readable data across and between organisations. GATE has been successfully used in open linked data applications, facilitating knowledge extraction from unstructured content and web data and turning this automatically into RDF/OWL-based metadata, which can also be inter-linked to the Linked Data ontologies and datasets.