GATE Projects

This page lists some of the projects involving GATE which are currently being undertaken, either by our team or outside Sheffield.

(This page is almost always out of date, as new projects start all the time.)

Projects with Sheffield involvement

[Language Technologies/Social Media | #INT_DLs, Digital Libraries/Corpus Annotation/Crowdsourcing) | E-science/Grid/Cloud | Semantic Web/Knowledge Technologies]

***** Language Technologies and Social Media Mining

Political Futures Tracker
Nesta-funded project (Nov 2014 - May 2015)

Analysis of political tweets and other texts in the run-up to the 2015 UK General Election

EC-funded project (Jan 2014 - Dec 2016), coordinated by Kalina Bontcheva from the GATE team

Identifying and tracking rumours as they spread through social media.

DFKI, University of Sheffield, Ontotext AD, University of Southampton, Stichting Internet Memory Foundation, Eurokleis S.R.L., Sora Ogris & Hofinger GMBH, Hardik Fintrade Pvt Ltd.

Innovative, portable open-source real-time methods for cross-lingual mining and summarisation of large-scale stream media.

EPSRC-funded CHIST-ERA project (Nov.2012 - Nov.2015), led by University of Sheffield

Embedded Human Computation for Knowledge Extraction and Evaluation.

EC-funded project (Oct.2013 - Sep.2016), led by the Knowledge Media Institute

We are building a decarbonisation platform for translating collective awareness of climate change into behavioural change. Our role is in mining social media: named entities, events, opinions, and controversies, using Linked Open Data.

University of Sheffield

Named entity recognition from diverse text types and genres.

University of Sheffield

An adaptive Information Extraction tool that uses GATE's open-source machine learning tools and allows users to train the system collaboratively by annotating a shared corpus in a Web browser

University of Sheffield, CNRS-LIMSI, GE Service Centre GMBH, VECSYS, VIEL & CIE, State University of New York Duke University, GE Research & Development

Building empirically induced dialogue processors to support multilingual human-computer interaction.

University of Sheffield

Armadillo uses multiple strategies, including IE using GATE components, to model a domain by connecting various entities and components, and to build an RDF ontology and knowledge base.

EU Funded eContent Project (Jan 2005 - Jun 2007) led by INRIA-LORIA, France

LIRICS aims to provide ISO-ratified standards for language technology, with an open-source implementation platform and services, to enable the exchange and reuse of multilingual language resources in response to the needs of today's multilingual information and communication society.

JISC sub-award (April 2009 to March 2011) led by King's College, London

EU Funded FP7 Project (Jan 2008 to Dec 2010) led by the University of Utrecht, Netherlands

CLARIN's mission is to create an infrastructure which makes language resources (annotated recordings, texts, lexica, ontologies) and technology (speech recognizers, lemmatizers, parsers, summarizers, information extractors) available and readily usable to scholars of all disciplines, in particular the humanities and social sciences (HSS).

***** Digital Libraries: Corpus annotation and processing

EU-funded Integrated Project (Feb.2013 - Jan.2016)

Concise Preservation by combining Managed Forgetting and Contextualized Remembering

EU-funded Integrated Project (Jan.2011 - Dec.2013), led by University of Sheffield

From archives to community memories.

EPSRC-funded CHIST-ERA project (Nov.2012 - Nov.2015), led by University of Sheffield

Embedded Human Computation for Knowledge Extraction and Evaluation.

The National Archives
Ontotext, University of Sheffield, System Simulation Limited

Bringing semantic annotation to the UK government's web archive.

A JISC-funded project with the British Library and HR Wallingford.

Semantic annotation and search with Linked Open Data, in the domain of environmental science.

University of Sheffield, University of Oxford

The project is building generic tools for linguistic annotation and Web based analysis of literary Sumerian.

University of Lancaster & University of Sheffield

Building a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK.

University of Sheffield

Named entity recognition on 17th century Old Bailey Court reports.

***** Digital Libraries: Multimedia

EU-funded Integrated Project (Feb.2004 - Jun.2007), led by Institut national de l?audiovisuel (INA), France

The project's objective is to provide technical solutions and integrated systems for a complete digital preservation of all kinds of audio-visual collections. Our role is to develop language technology methods for (semi-)automatic creation of metadata from multimedia content.

CTIT (Netherlands), University of Sheffield, University of Nijmegen (Netherlands), DFKI (Germany), Max Planck InstitutfürPsycholinguistik (Germany), ESTEAM (Sweden), VDA (Netherlands)

Automatic creation of indexes into multimedia programme material, using data from several sources and several languages, in the domain of football.

University of Sheffield, University of Surrey

Integration of knowledge acquisition, information extraction, image processing and speech recognition technologies in the domain of police crime reports.

University of Sheffield

Summarisation of information from company reports to generate statistics about the level of compliance with Health and Safety recommendations and legislation.

***** E-science, Grid and Cloud Computing

EU Funded Small or medium scale focused research project (STREP)

A collaborative project funded by the EC that aims to deliver an affordable, open marketplace for pay-as-you-go, cloud-based extraction resources and services, in multiple languages.

GATE Cloud Exploratory
University of Sheffield

A small exploratory project funded by JISC and the EPSRC to experiment with various aspects of cloud computing.

EU-funded Integrated Project (Sep.2010 - Aug.2014), led by University of Applied Sciences of Western Switzerland

A knowledge-helper for health information.

University of Manchester, University of Sheffield

An e-science bioinformatics project for biodiversity support.

University of Sheffield, University of Southampton, Open University (KMI), University of Oxford, Guy's Hospital, King's College London

Collaborative problem solving environments in Medical Informatics, using knowledge services provided by the e-Science grid infrastructure.

University of Manchester, University of Newcastle, University of Nottingham, University of Sheffield, University of Southampton, IT Innovation Centre, European Bioinformatics Institute

Extending the GRID framework of distributed conputing by producing a virtual laboratory bench that will support the life sciences community and make use of complex distributed resources.

University of Manchester, CHIME/University College London, University of Brighton, University of Sheffield, CambridgeUniversity Health

Building on E-Science technology to embed a full information cycle within practical clinical systems, building tools to integrate patient information from text and images, and linking clinical and genomic research.

***** Semantic Web and Knowledge Technologies

EPSRC-funded CHIST-ERA project (Nov.2012 - Nov.2015), led by University of Sheffield

Embedded Human Computation for Knowledge Extraction and Evaluation.

EC-funded project (Oct.2013 - Sep.2016), led by the Knowledge Media Institute

We are building a decarbonisation platform for translating collective awareness of climate change into behavioural change. Our role is in mining social media: named entities, events, opinions, and controversies, using Linked Open Data.

University of Aberdeen, University of Edinburgh, Open University, University of Sheffield, University of Southampton

Builds new knowledge acquisition, retrieval, and publishing tools based on Language Engineering and using GATE.

EU-funded Integrated Project (Jan.2004 - Dec.2006), led by BT

The vision of SEKT is to develop and exploit the knowledge technologies which underlie Next Generation Knowledge Management. The SEKT strategy is built around the synergy of the complementary know-how in Ontology and Metadata Technology, Knowledge Discovery and Human Language Technology.

EU-funded Network of Excellence (Jan.2004 - Dec.2007), led by University of Innsbruck

The transition of the Semantic Web from an academic adventure into a technology provided by software industry is still a long way ahead. The main goal of KnowledgeWeb is to support this process. Our role is in providing expertise on the role of Human Language Technology in ontology-based applications.

University of Surrey, University of Sheffield, Athens Technology Center, University of Innsbruck, UniversitatRovira I Virgili

An IST project developing a knowledge management platform with intelligence and insight capabilities for technology intensive industries. Sheffield provides the platform with a targeted search module to analyse the content of webpages and track interesting instances of concepts over time. 

University of Southampton, University of Sheffield

This informal project, which is part of AKT, produces composite descriptions of cultural artefacts and figures (e.g. Rembrandt) from diverse web pages, uses GATE-based Natural Language Generation system. ArtEquAkt is a collaboration between the Equator wearable computing project and the AKT Knowledge Technologies project.

University of Sheffield, University of Karlsruhe (Germany), Open University, Ontoprise (Germany), Quinary (Italy), ITC-IRST (Italy)

This project aims at defining tools and methodologies for IE-based Knowledge Management, focusing on adaptive IE using Machine Learning, and will be developed using GATE components.

University of Sheffield

An ontology building environment incorporating adaptive IE using GATE components.

University of Sheffield

EU-funded Specific Targeted Research Project (Mar 2006 - Feb 2009), led by University of Sheffield, UK

The project's goal is to show how existing 'legacy' applications can migrate to open, semantic-based Service-Oriented Architectures at acceptable development cost. Sheffield's main role will be on automatic methods for content augmentation and integration.

EU-funded Integrated Project (Apr 2006 - Feb 2010), led by The Open University, UK

NeOn aims to provide a considerable improvement in the level of support available for ontology engineering by developing both a reference architecture and a concrete toolkit supporting the whole ontology engineering life cycle. The University of Sheffield will develop an open-source demonstrator of collaborative semantic annotation with networked ontologies.

EU-funded Integrated Project (Apr 2008 - Sept 2011), led by University of Innsbruck

Current Semantic Web reasoning systems do not scale to the requirements of their hottest applications, such as dealing with terabytes of scientific data. LarKC aims to remove these scalability barriers by using massive distributed incomplete reasoning. The University of Sheffield's main roles are to provide methods for retrieval and selection of data for reasoning, and to demonstrate the platform in support of carcinogenesis research.

EU-funded Specific Targeted Research Project (Mar 2006 - Feb 2009), led by Joanneum Research, Austria

The project's main goal is to automate to a large degree the detection and tracking of media campaigns on television, Internet and in the press. MediaCampaign's scope is on discovering, inter-relating and navigating cross-media campaign knowledge. University of Sheffield would be involved in text analysis (press, Internet, speech transcript), product knowledge interlinking, unification of two or more (partial) descriptions of an instance and knowledge fusion for campaign discovery.

EU-funded Integrated Project (2006 - 2010) led by Metaware S.p.A., Italy

MUSING will integrate Semantic Web and Human Language technologies and combine declarative rule-based methods and statistical approaches for enhancing the technological foundations of knowledge acquisition and reasoning in BI applications.

EU-funded Project (2008 - 2009) led by CEFRIEL, Italy

Service-Finder aims to develop a platform for service discovery in which Web Services are embedded in a Web 2.0 environment.

Projects Outside Sheffield

[Semantic Web | Digital Libraries / Cultural Heritage | E-science/bio-informatics | Language Technology | Other Applications]

***** Knowledge Management and Semantic Web

Ontotext , Bulgaria

ANNIE-powered Semantic Web annotation as part of their Knowledge and Information Management (KIM) platform.

Knallgrau New Media Solutions GmbH, Germany

The goal of MeManage is to relate personal Information on your Computer and computers of your peers to one another. Some ideas from the Semantic Web combined with the ease and simplicity of Weblogs and Wikis, plus a little Social Software.


The availability of semantic markup on the web opens the way to novel, sophisticated forms of question answering. While semantic information can be used in several different ways to improve question answering, an important consequence of the availability of semantic markup on the web is that this can indeed be queried directly. AquaLog is a portable question-answering system which takes queries expressed in natural language and an ontology as input and returns answers drawn from one or more knowledge bases (KBs), which instantiate the input ontology with domain-specific information. AquaLog present an elegant solution in which different strategies are combined together. It makes use of the GATE NLP platform as part of the linguistic process , string metrics algorithms , a learning mechanism as a solution to manage lexical resources, including domain-dependent lexica and generic resources such as WordNet. AquaLog also makes use of a novel ontology-based relation similarity service to make sense of user queries with respect to the target knowledge base. Contact email: v.lopez@open.ac.uk

Med Dictate
Medwrite Inc, Anaheim, Canada

The automation of customized retrieval of medical transcriptions.

Engineering a Semantic Desktop for Building Historians and Architects

We analyse the requirements for an advanced semantic support of users-building historians and architects-of a multi-volume encyclopedia of architecture from the late 19th century. Novel requirements include the integration of content retrieval, content development, and automated content analysis based on natural language processing. We present a system architecture for the detected requirements and its current implementation. A complex scenario demonstrates how a desktop supporting semantic analysis can contribute to specific, relevant user tasks. Email: witte@ipd.uka.de

***** Digital Libraries and Cultural Heritage

University of Waikato, New Zealand

Greenstone is a suite of software for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM. Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO. It is open-source, multilingual software, issued under the terms of the GNU General Public License.

Greenstone uses GATE and ANNIE to enhance digital collections by addition of metadata.


The European Heritage On-Line (ECHO) project is developing a model for European culture on the web. The GATE team is represented on the technical board of ECHO and are working towards transfer of advanced text processing tools to help produce a new model of richly interlinked shared cultural materials.

Tufts University, Massachussetts, USA

The Perseus digital library, one of the largest and most advanced such projects in the world, uses GATE for corpus annotation and language processing.

***** E-science and bioinformatics

Parallel IE
Merck KGaA, Darmstadt, Germany

Information Extraction on a Linux cluster for bio-medical text mining and indexing.

Medical Informatics
University of Pittsburgh, USA

Annotating surgical pathology reports using UMLS.

Medical Informatics
Institute for Medical Informatics and Biometry, University of Rostock, Germany

Analysing MEDLINE abstracts to extract causal functional relations, which are essential for the construction of genetic networks, as a step towards characterisation of diseases.

University College, London, U.K.

BioRAT is a general-purpose information extraction tool, designed to be used by biologists to mine text from journals. It has been successfully applied to protein-protein interaction discovery, and projects are underway in several other areas. It uses GATE at its core, while also providing tools to design new templates; to edit gazetteers; and to download full-length papers from the web. The software is available for academic use, and is part of a research project, funded by the BBSRC.

Institute for Medical Informatics and Biometry, University of Rostock, Germany

The information extraction for structural biology project aims at the extraction of information from the 'material and method' part of the structural biology publications. The purpose of this is to fulfil a database. Some of the informations for the database are retrieved from structured files named PDB (see http://www.rcsb.org/pdb/). The material and method used for experiments are not in PDB files. So we intend extract that information from the text of the publications. Contact email: huault@igbmc.u-strasbg.fr.

Visualization of Consumer Health Information
School of Information Systems & Technology, Claremont, USA

Research indicates that the text on many popular web sites is difficult to understand and consumers find that reading documents in electronic format is problematic. Since health information read online influences the patient-doctor relationship - e.g., treatments requested, or perceived patient value from a doctor's visit - it is important that this information be interpreted and remembered as completely and correctly as possible. Misunderstandings in health information may increase the risk of making unwise health decisions, which could lead to poorer health and higher health care costs. The goal of the project is to develop and test new technology that can present online health information that is easier to understand and remember. Prototypes will be developed that will visualize both the structure and content of web pages to increase understanding and retention without oversimplification. A small pilot study has shown positive effects of such a representation. The two prototypes will differ in how much content detail is included in the visualization. They will be evaluated for their effects on understanding and retention of information and compared with currently existing web sites. User behavior and preferences will also be captured and analyzed. Three user groups will participate in the development and evaluation of the prototypes: elderly consumers, Hispanic non-native speakers, and patients. These groups were chosen for their specific characteristics (age related problems, sub-optimal command of English, and patients' stress) that may require improved information presentation. Contact: Gondy Leroy

eHealth GATEway project
University of Leeds, UK

Anonymisation of patient health records with GATE. JISC-funded project (Feb-Aug 2012). Project website. Contact: Owen Johnson

HiTEx project
U.S. National Library of Medicine, National Institutes of Health, USA

Health Information Text Extraction (HiTEx) tool based on GATE - a modular system that assembles a different pipeline for extracting specific findings from clinical narratives. Funded by National Library of Medicine, National Institutes of Health, USA. Project website. Paper: What can Natural Language Processing do for Clinical Decision Support? Contact: Dina Demner-Fushman

***** Human Language Technology

Pieces Evidence
TRW Systems, Colorado Springs, USA

Converting text into pieces evidence as part of an R&D project for the US government.

IE Denso

IT Laboratory, Japan

Development of IE and other language tools for in-car navigation systems and automobile-related speech and language technology.

Database technology
Birkbeck College, London, U.K.

Using Information Extraction (IE) to enhance the support for text in database technology.

Enactable Models
Middlesex University, U.K.

Building a summarisation system based on discourse structure.

Semantic Analysis Over Sparse Data

A John Hopkins 2003 summer workshop at the Centre for Language and Speech Processing on learning-based semantic annotation to reduce data sparseness in diverse corpora.

Imperial College, London

Building a summarisation system entered in the Document Understanding Conference (DUC 2002) evaluation.

Named Entity recognition for Machine Translation
University of Leeds, U.K.

Work on improving MT systems using NE recognition in GATE.

University of Edinburgh

Integration of GATE with the GROK/OpenNLP project - a library of NLP components including support for parsing and various pre-processing tasks.

Mission Abstraction

To develop a tool that can produce the synopsis of the text document given to it.

Internal R&D
Linguit Ltd, Edinburgh, UK

At Linguit Ltd., we use GATE for internal research purposes because it allows us to explore new ideas in the area of information extraction rapidly.

Madan Puraskar Pustakalaya, Nepal

Nepali language localization. Contact name: Laxmi Pd Khatiwada.

Building An Information Extraction System

Objective is to extract the required information IN STRUCTURED FORM from the unstructured text and there are 4 tasks to accomplish the main task. They are named entity recognition, coreference, template element task, and scenario task.

***** Other Applications
University of Georgia

Analysis of case records of child protective service workers.

Email Summary
CI Secure, Toronto, Canada

We are developing an e-mail summarization program service. Basically, we like to capture the essence of an e-mail message and present it as 1-2 line summary. This will be integrated with our current mail service, and will help our client save time. We a looking for partners, who are experienced with Gate to co-develop this technology. Contact email: gate@ci-secure.com

Sindbad - Knowledge Generation
NetBreeze GmbH, Dübendorf, Switzerland

The generation of industry-specific knowledge from openly accessible internet-sources for direct integration into business processes.

Leads Generator by using IE

Crawl web pages, read from rss news and so on. Extract info of corporation about contacts, industry events, etc.


Aims at combining AI & IT experience within the NLP domain and R&D in knowledge management. In addition, to develop and implement special environment Ontos WorkGroup with functionality of communication with OntosMiner server, viewing and editing received Cognitive Maps and, in general, to support the technology of ontology-driven content extraction for several domains.

Multi-Lingual Noun Phrase Extractor (NPE)
The Universität Karlsruhe, Germany

JAPE-based. Currently supported languages are English, German, and French. It requires a part-of-speech tagger to work (it has been tested with the Hepple tagger for English and the TreeTagger for German and French. One particular feature is that it can use previously detected named entities (like the Person, Organization, ... found by ANNIE) to improve chunking performance.

Università degli Studi di Parma, Italy

SP2A is a thin framework enabling peer-to-peer based Grids. In practice, SP2A is a lightweight Java package allowing the development of service-oriented peers (SOPs). These SOPs can be used to form unstructured supernode networks (USNs), and exchange information about the services they host using P2P message routing algorithms. Each SOP is allowed to explore local services, to publish service advertisements remotely, and to search for remote services. Contact name: Michele Amoretti .


Mining a vita from the web. Contact: vamshi@andrew.cmu.edu