GATE Information Extraction
If information is power and riches, then it is not the amount that gives the value, but access at the right time and in the most suitable form.
Information Extraction (IE) systems analyse unrestricted text in order to extract information about pre-specified types of events, entities or relationships.
GATE has been used for many IE projects in many languages and problem domain, and has competed in the MUC and ACE evaluations. GATE has a built-in IE component set called ANNIE. Below is a short introduction to IE; for a longer introduction see this IE User Guide.
For more information about GATE and IE, contact the GATE team. See also the new edition of the Encyclopaedia of Language and Linguisics survey article on IE. Sheffield and others may be able to provide services to customise GATE to your needs. See also:
(Note: chunks of these pages are derived from a previous version written by Malcolm Crawford.)
Information Extraction is not Information Retrieval: Information Extraction differs from traditional techniques in that it does not recover from a collection a subset of documents which are hopefully relevant to a query, based on key-word searching (perhaps augmented by a thesaurus). Instead, the goal is to extract from the documents (which may be in a variety of languages) salient facts about prespecified types of events, entities or relationships. These facts are then usually entered automatically into a database, which may then be used to analyse the data for trends, to give a natural language summary, or simply to serve for on-line access.
- Information Retrieval gets sets of relevant documents --
you analyse the documents
Information Extraction gets facts out of documents --
you analyse the facts
Here are some example applications of IE.
Why is Information Extraction difficult?
There are many ways of expressing the same fact:
- BNC Holdings Inc named Ms G Torretta as its new chairman.
- Nicholas Andrews was succeeded by Gina Torretta as chairman of BNC Holdings Inc.
- Ms. Gina Torretta took the helm at BNC Holdings Inc.
Information may need to be combined across several sentences:
- After a long boardroom struggle, Mr Andrews stepped down as chairman of BNC Holdings Inc. He was succeeded by Ms Torretta.
You might want to try an Information Extraction task yourself.