(This page is almost always out of date, as new projects start all the time.)
Projects with Sheffield involvement
|***** Digital Libraries: Corpus annotation and processing|
|Concise Preservation by combining Managed Forgetting and Contextualized Remembering|
|From archives to community memories.|
The National Archives
|Bringing semantic annotation to the UK government's web archive.|
|Semantic annotation and search with Linked Open Data, in the domain of environmental science.|
The project is building generic tools for linguistic annotation and Web based analysis of literary Sumerian.
Building a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK.
Named entity recognition on 17th century Old Bailey Court reports.
|***** Digital Libraries: Multimedia|
The project's objective is to provide technical solutions and integrated systems for a complete digital preservation of all kinds of audio-visual collections. Our role is to develop language technology methods for (semi-)automatic creation of metadata from multimedia content.
Automatic creation of indexes into multimedia programme material, using data from several sources and several languages, in the domain of football.
Integration of knowledge acquisition, information extraction, image processing and speech recognition technologies in the domain of police crime reports.
Summarisation of information from company reports to generate statistics about the level of compliance with Health and Safety recommendations and legislation.
|***** E-science, Grid and Cloud Computing|
A collaborative project funded by the EC that aims to deliver an affordable, open marketplace for pay-as-you-go, cloud-based extraction resources and services, in multiple languages.
GATE Cloud Exploratory
A small exploratory project funded by JISC and the EPSRC to experiment with various aspects of cloud computing.
|A knowledge-helper for health information.|
An e-science bioinformatics project for biodiversity support.
Collaborative problem solving environments in Medical Informatics, using knowledge services provided by the e-Science grid infrastructure.
Extending the GRID framework of distributed conputing by producing a virtual laboratory bench that will support the life sciences community and make use of complex distributed resources.
Building on E-Science technology to embed a full information cycle within practical clinical systems, building tools to integrate patient information from text and images, and linking clinical and genomic research.
|***** Language Technologies and Social Media Mining|
Innovative, portable open-source real-time methods for cross-lingual mining and summarisation of large-scale stream media.
Named entity recognition from diverse text types and genres.
An adaptive Information Extraction tool that uses GATE's open-source machine learning tools and allows users to train the system collaboratively by annotating a shared corpus in a Web browser
Building empirically induced dialogue processors to support multilingual human-computer interaction.
Armadillo uses multiple strategies, including IE using GATE components, to model a domain by connecting various entities and components, and to build an RDF ontology and knowledge base.
LIRICS aims to provide ISO-ratified standards for language technology, with an open-source implementation platform and services, to enable the exchange and reuse of multilingual language resources in response to the needs of today's multilingual information and communication society.
CLARIN's mission is to create an infrastructure which makes language resources (annotated recordings, texts, lexica, ontologies) and technology (speech recognizers, lemmatizers, parsers, summarizers, information extractors) available and readily usable to scholars of all disciplines, in particular the humanities and social sciences (HSS).
|***** Semantic Web and Knowledge Technologies|
Builds new knowledge acquisition, retrieval, and publishing tools based on Language Engineering and using GATE.
The vision of SEKT is to develop and exploit the knowledge technologies which underlie Next Generation Knowledge Management. The SEKT strategy is built around the synergy of the complementary know-how in Ontology and Metadata Technology, Knowledge Discovery and Human Language Technology.
The transition of the Semantic Web from an academic adventure into a technology provided by software industry is still a long way ahead. The main goal of KnowledgeWeb is to support this process. Our role is in providing expertise on the role of Human Language Technology in ontology-based applications.
An IST project developing a knowledge management platform with intelligence and insight capabilities for technology intensive industries. Sheffield provides the platform with a targeted search module to analyse the content of webpages and track interesting instances of concepts over time.
This informal project, which is part of AKT, produces composite descriptions of cultural artefacts and figures (e.g. Rembrandt) from diverse web pages, uses GATE-based Natural Language Generation system. ArtEquAkt is a collaboration between the Equator wearable computing project and the AKT Knowledge Technologies project.
This project aims at defining tools and methodologies for IE-based Knowledge Management, focusing on adaptive IE using Machine Learning, and will be developed using GATE components.
An ontology building environment incorporating adaptive IE using GATE components.
An ontology building environment incorporating adaptive IE using GATE components.
The project's goal is to show how existing 'legacy' applications can migrate to open, semantic-based Service-Oriented Architectures at acceptable development cost. Sheffield's main role will be on automatic methods for content augmentation and integration.
NeOn aims to provide a considerable improvement in the level of support available for ontology engineering by developing both a reference architecture and a concrete toolkit supporting the whole ontology engineering life cycle. The University of Sheffield will develop an open-source demonstrator of collaborative semantic annotation with networked ontologies.
Current Semantic Web reasoning systems do not scale to the requirements of their hottest applications, such as dealing with terabytes of scientific data. LarKC aims to remove these scalability barriers by using massive distributed incomplete reasoning. The University of Sheffield's main roles are to provide methods for retrieval and selection of data for reasoning, and to demonstrate the platform in support of carcinogenesis research.
The project's main goal is to automate to a large degree the detection and tracking of media campaigns on television, Internet and in the press. MediaCampaign's scope is on discovering, inter-relating and navigating cross-media campaign knowledge. University of Sheffield would be involved in text analysis (press, Internet, speech transcript), product knowledge interlinking, unification of two or more (partial) descriptions of an instance and knowledge fusion for campaign discovery.
MUSING will integrate Semantic Web and Human Language technologies and combine declarative rule-based methods and statistical approaches for enhancing the technological foundations of knowledge acquisition and reasoning in BI applications.
Service-Finder aims to develop a platform for service discovery in which Web Services are embedded in a Web 2.0 environment.
Projects Outside Sheffield
|***** Knowledge Management and Semantic Web|
ANNIE-powered Semantic Web annotation as part of their Knowledge and Information Management (KIM) platform.
The goal of MeManage is to relate personal Information on your Computer and computers of your peers to one another. Some ideas from the Semantic Web combined with the ease and simplicity of Weblogs and Wikis, plus a little Social Software.
The availability of semantic markup on the web opens the way to novel, sophisticated forms of question answering. While semantic information can be used in several different ways to improve question answering, an important consequence of the availability of semantic markup on the web is that this can indeed be queried directly. AquaLog is a portable question-answering system which takes queries expressed in natural language and an ontology as input and returns answers drawn from one or more knowledge bases (KBs), which instantiate the input ontology with domain-specific information. AquaLog present an elegant solution in which different strategies are combined together. It makes use of the GATE NLP platform as part of the linguistic process , string metrics algorithms , a learning mechanism as a solution to manage lexical resources, including domain-dependent lexica and generic resources such as WordNet. AquaLog also makes use of a novel ontology-based relation similarity service to make sense of user queries with respect to the target knowledge base. Contact email: email@example.com
The automation of customized retrieval of medical transcriptions.
We analyse the requirements for an advanced semantic support of users-building historians and architects-of a multi-volume encyclopedia of architecture from the late 19th century. Novel requirements include the integration of content retrieval, content development, and automated content analysis based on natural language processing. We present a system architecture for the detected requirements and its current implementation. A complex scenario demonstrates how a desktop supporting semantic analysis can contribute to specific, relevant user tasks. Email: firstname.lastname@example.org
|***** Digital Libraries and Cultural Heritage|
Greenstone is a suite of software for building and
distributing digital library collections. It provides a new way
of organizing information and publishing it on the Internet or
on CD-ROM. Greenstone is produced by the New Zealand Digital
Library Project at the University of Waikato, and developed and
distributed in cooperation with UNESCO and the Human Info NGO.
It is open-source, multilingual software, issued under the terms
of the GNU General Public License.
Greenstone uses GATE and ANNIE to enhance digital collections by addition of metadata.
The European Heritage On-Line (ECHO) project is developing a model for European culture on the web. The GATE team is represented on the technical board of ECHO and are working towards transfer of advanced text processing tools to help produce a new model of richly interlinked shared cultural materials.
The Perseus digital library, one of the largest and most advanced such projects in the world, uses GATE for corpus annotation and language processing.
|***** E-science and bioinformatics|
Information Extraction on a Linux cluster for bio-medical text mining and indexing.
|Annotating surgical pathology reports using UMLS.|
Analysing MEDLINE abstracts to extract causal functional relations, which are essential for the construction of genetic networks, as a step towards characterisation of diseases.
BioRAT is a general-purpose information extraction tool, designed to be used by biologists to mine text from journals. It has been successfully applied to protein-protein interaction discovery, and projects are underway in several other areas. It uses GATE at its core, while also providing tools to design new templates; to edit gazetteers; and to download full-length papers from the web. The software is available for academic use, and is part of a research project, funded by the BBSRC.
The information extraction for structural biology project aims at the extraction of information from the 'material and method' part of the structural biology publications. The purpose of this is to fulfil a database. Some of the informations for the database are retrieved from structured files named PDB (see http://www.rcsb.org/pdb/). The material and method used for experiments are not in PDB files. So we intend extract that information from the text of the publications. Contact email: email@example.com.
Visualization of Consumer Health Information
Research indicates that the text on many popular web sites is difficult to understand and consumers find that reading documents in electronic format is problematic. Since health information read online influences the patient-doctor relationship - e.g., treatments requested, or perceived patient value from a doctor's visit - it is important that this information be interpreted and remembered as completely and correctly as possible. Misunderstandings in health information may increase the risk of making unwise health decisions, which could lead to poorer health and higher health care costs. The goal of the project is to develop and test new technology that can present online health information that is easier to understand and remember. Prototypes will be developed that will visualize both the structure and content of web pages to increase understanding and retention without oversimplification. A small pilot study has shown positive effects of such a representation. The two prototypes will differ in how much content detail is included in the visualization. They will be evaluated for their effects on understanding and retention of information and compared with currently existing web sites. User behavior and preferences will also be captured and analyzed. Three user groups will participate in the development and evaluation of the prototypes: elderly consumers, Hispanic non-native speakers, and patients. These groups were chosen for their specific characteristics (age related problems, sub-optimal command of English, and patients' stress) that may require improved information presentation. Contact: Gondy Leroy
eHealth GATEway project
Health Information Text Extraction (HiTEx) tool based on GATE - a modular system that assembles a different pipeline for extracting specific findings from clinical narratives. Funded by National Library of Medicine, National Institutes of Health, USA. Project website. Paper: What can Natural Language Processing do for Clinical Decision Support? Contact: Dina Demner-Fushman
|***** Human Language Technology|
Converting text into pieces evidence as part of an R&D project for the US government.
IT Laboratory, Japan
Development of IE and other language tools for in-car navigation systems and automobile-related speech and language technology.
Using Information Extraction (IE) to enhance the support for text in database technology.
Building a summarisation system based on discourse structure.
A John Hopkins 2003 summer workshop at the Centre for Language and Speech Processing on learning-based semantic annotation to reduce data sparseness in diverse corpora.
Building a summarisation system entered in the Document Understanding Conference (DUC 2002) evaluation.
Named Entity recognition for Machine Translation
|Work on improving MT systems using NE recognition in GATE.|
Integration of GATE with the GROK/OpenNLP project - a library of NLP components including support for parsing and various pre-processing tasks.
To develop a tool that can produce the synopsis of the text document given to it.
At Linguit Ltd., we use GATE for internal research purposes because it allows us to explore new ideas in the area of information extraction rapidly.
Nepali language localization. Contact name: Laxmi Pd Khatiwada.
|Building An Information Extraction System|
Objective is to extract the required information IN STRUCTURED FORM from the unstructured text and there are 4 tasks to accomplish the main task. They are named entity recognition, coreference, template element task, and scenario task.
|***** Other Applications|
| University of Georgia || |
Analysis of case records of child protective service workers.
We are developing an e-mail summarization program service. Basically, we like to capture the essence of an e-mail message and present it as 1-2 line summary. This will be integrated with our current mail service, and will help our client save time. We a looking for partners, who are experienced with Gate to co-develop this technology. Contact email: firstname.lastname@example.org
Sindbad - Knowledge
The generation of industry-specific knowledge from openly accessible internet-sources for direct integration into business processes.
| Leads Generator by using IE || |
Crawl web pages, read from rss news and so on. Extract info of corporation about contacts, industry events, etc.
Aims at combining AI & IT experience within the NLP domain and R&D in knowledge management. In addition, to develop and implement special environment Ontos WorkGroup with functionality of communication with OntosMiner server, viewing and editing received Cognitive Maps and, in general, to support the technology of ontology-driven content extraction for several domains.
Multi-Lingual Noun Phrase Extractor (NPE)
JAPE-based. Currently supported languages are English, German, and French. It requires a part-of-speech tagger to work (it has been tested with the Hepple tagger for English and the TreeTagger for German and French. One particular feature is that it can use previously detected named entities (like the Person, Organization, ... found by ANNIE) to improve chunking performance.
SP2A is a thin framework enabling peer-to-peer based Grids. In practice, SP2A is a lightweight Java package allowing the development of service-oriented peers (SOPs). These SOPs can be used to form unstructured supernode networks (USNs), and exchange information about the services they host using P2P message routing algorithms. Each SOP is allowed to explore local services, to publish service advertisements remotely, and to search for remote services. Contact name: Michele Amoretti .
Mining a vita from the web. Contact: email@example.com