GATE.ac.uk - sale/rnti-09/basic-gate-stuff.txt

Mining Profiles and Definitions with Natural Language Processing

Horacio Saggion
Department of Computer Science
University of Sheffield
Regent Court
211 Portobello Street � Sheffield
S1 4DP � England � United Kingdom
Tel: +44-114-222-1947
Fax: +44-114-222-1810
saggion@dcs.shef.ac.uk

Mining Profiles and Definitions with Natural Language Processing
ABSTRACT
Free text is a main repository of human knowledge, therefore methods and techniques to access this unstructured source of knowledge are of paramount importance. In this chapter we describe natural language processing technology for the development of question answering and text summarization systems. We focus on applications aiming at mining textual resources to extract knowledge for the automatic creation of definitions and person profiles.
KEYWORDS
Web Resources; Information Capturing; Electronic Texts; Electronic Resources; Information Access; Business Applications; Digital Document; Natural Languages; Natural Language Processors; Data Extraction; Knowledge Discovery; Information Filtering; Text Processing Software; Data Mining Algorithms; Data Mining
INTRODUCTION

Extracting relevant information from massive amounts of free text about people, companies, organisations, locations, and common terms in order to create definitions or profiles is a very challenging problem not only because it is very difficult to elucidate in a precise way what type of information about these entities is relevant for a definition/profile, but also because even if some types of information were known to be relevant, there are many ways of expressing them in natural language texts. As free text is by far the main repository of human knowledge, solutions to the problem of extracting definitional information have many applications in areas of knowledge management and intelligence:

* In intelligence analysis activities: there is a need for access to personal information in order to create briefings for meetings; and for tracking activities of individuals in time and space;

* In journalism, broadcasting, and news reporting activities: there is a need to find relevant information for writing backgrounds for the main actors of a breaking news story; but also term definitions need to be provided to non-specialist audiences (e.g., What is bird flu?);

* In publishing: Encyclopaedias and dictionaries need to be updated with new information about people and other entities found in text repositories;

* In knowledge engineering: ontologies and other knowledge repositories need to be populated with instances such as persons and their attributes extracted from text; but also new terms together with their definitions need to be identified in texts in order to make informed decisions about their inclusion in these knowledge repositories;

* In business intelligence: information about companies and their key company officers is of great relevance for decision making processes such as whether or not to give credit to a company given the profile of a key company director.

Recent natural language processing challenges such as the Document Understanding Conferences (DUC) (http://www-nlpir.nist.gov/projects/duc/) and the Text Retrieval Conferences Question Answering track (TREC/QA) evaluations (http://trec.nist.gov/data/qa.html) have focused on this particular problem and are creating useful language resources to study the problem and measure technical advances. For example, in task 5 in the recent DUC 2004 system participants had to create summaries from sets of documents answering the question Who is X?, and from 2003 onwards, the TREC/QA evaluations have a specific task which consists of finding relevant information about a person, an organisation, an event or a common term in a massive text repository (e.g. What is X?).

In the Natural Language Processing Group at the University of Sheffield we have been working on these problems for many years, and we have developed effective tools to address them using GATE, the General Architecture for Text Engineering. The main purpose of this chapter is to study the problem of mining textual sources in order to find definitions, profiles, and biographies. This chapter provides first an overview of generic techniques in natural language processing to then present two case studies of the use of natural language technology in DUC and TREC/QA.

NATURAL LANGUAGE PROCESSIBG TOOLS

The General Architecture for Text Engineering (GATE) is a framework for the development and deployment of language processing technology in large scale (Cunningham, Maynard, Bontcheva, & Tablan, 2002). It provides three types of resources: Language Resources (LRs) which collectively refer to data; Processing Resources (PRs) which are used to refer to algorithms; and Visualisation Resources (VRs) which represent visualisation and editing components.

GATE can be used to process documents in different formats including plain text, HTML, XML, RTF, and SGML. When a document is loaded or opened in GATE, a document structure analyser is called upon which is in charge of creating a GATE document, a LR which will contain the text of the original document and one or more sets of annotations, one of which will contain the document markups (for example html).

Annotations are generally updated by PRs during text analysis - but they can also be created during annotation editing in the GATE GUI (see Figure 1 for the GATE GUI). Each annotation belongs to an annotation set and has a type, a pair of offsets (the span of text one wants to annotate), and a set of features and values that are used to encode the information. Features (or attribute names) are strings, and values can be any Java object. Attributes and values can be specified in an annotation schema which facilitates validation and input during manual annotation. In Figure 1 we show the GATE user interface. Programmatic access to the annotation sets, annotations, features and values is possible not only through the GATE Application Program Interface but also in the JAPE language (see below).

Figure 1 GATE Graphical User Interface. A document has been annotated with semantic information.
Text Processing Tools

After documents are uploaded or opened with GATE, one of the first steps in text processing is document tokenisation which is the process of segmenting the text of the document in units representing words, punctuation, and other elements. Two kinds of annotation are produced: �Token� - for words, numbers, symbols, and punctuation, and �SpaceToken� � for spaces and control characters. Features computed during this process are the type of tokens (word, punctuation, number, space, control character, etc.), their lengths, and their orthographic characteristics (all capitals, all lowercase, capital initial, etc). The process of tokenisation can be modified by changing the tokenisation rules of the system.

One important step after tokenisation is sentence identification which is the process of segmenting the text into sentences: in GATE this is implemented through a cascade of finite state transducers which are created from a grammar file which can be customised. The process is language and domain dependent and makes use of the annotations produced by the tokenisation process (i.e. presence of punctuation marks, control characters and abbreviations in the input document). This process produces a �Sentence� type of annotation and a default feature computed is an indication of whether or not the sentence is a quoted expression.

In text processing a part of speech (POS) tagging process is usually required. This is the process of associating to each word form or symbol a tag representing its part of speech. In GATE, it is implemented with a modified version of the Brill tagger (Brill, 1995). The process is dependent on the language and relies (as the Brill tagger does) on two resources � a lexicon and a set of transformation rules - which are trained over corpora. The default POS tagger in GATE is already trained for the English language. This process does not produce any new annotation type, but enriches the �Token� annotation by introducing a feature category which indicates the part of speech of the word. For example in the sentence �The company acquired the building for �2M� words such as �company� and �building� would be tagged as nouns, and the word �acquired� would be tagged as a verb. Note that the process has to solve the inherent ambiguities of the language (�building� can be both a noun and a verb).

One way of obtaining canonical forms from each word in the document is to apply a lemmatiser which will analyse each word in its constituent parts and identify the word root and affixes. This GATE process enriches the token annotation with two features root (for the word root) and affix (for the word ending).

Semantic Annotation

A process of semantic annotation consists of recognising and classifying a set of entities in the document, commonly referred as to named entity (NE) recognition task. NE recognition is a key enabler of information extraction � the identification and extraction of key facts from text in specific domains. Today NE recognition is a mature technology which achieves performance levels of precision and recall above 90% for newswire texts where entities of interest are for example people, locations, times, organizations, etc.

Much research on NE recognition has been carried out in the context of the US sponsored Message Understanding Conferences (MUC) from 1987 until 1997 for research and development of information extraction systems. The ACE programme was an extension of MUC but where the NE recognition task became more complex in the sense of being replaced by an entity detection and tracking task which involved, in addition to recognition, the identification of all mentions of a given entity (Maynard, Bontcheva, Cunningham, 2003). Other international efforts in the NE recognition task include the Conference on Computational Natural Language Learning (http://www.cnts.ua.ac.be/conll2003) and the HAREM evaluation for the Portuguese language (http://poloxldb.linguateca.pt/harem.php).

In GATE NE recognition is carried out with two PRs, a gazetteer lookup process and a pattern matching and annotation process. The goal of the gazetteer lookup module is to identify key words related to particular entity types in a particular domain. This step is particularly useful, because certain words can be grouped into classes and subclasses and this information allows the grammars to be semantically motivated and flexible.

Gazetteer lists are plain text files, with one entry per line. Each list contains words or word sequences representing keywords or terms that represent part of the domain knowledge. An index file is used to tell the system which lists to use in a particular application. In order to classify terms into categories, for each list, a major type is specified and, optionally, a minor type using the following format:

Terms.lst : Class : SubClass

where Term.lst is the name of the list, and Class and SubClass are strings. The index file is used to create an automaton (which, in order to be efficient, operates on strings instead of on annotations) which recognizes and classifies the keywords.

When the finite state automaton matches any strings with a term belonging to Terms.lst, a Lookup annotation is produced that spans the matched sequence; the features added to the annotation are majorType with value Class and minorType with value SubClass (note that when the minorType is omitted no minorType feature will be produced). Grammars use the information about the lookup process in their rules, to produce meaningful annotations.

In order to identify and classify sequences of tokens in the source document we rely on the Java Annotation Pattern Engine (JAPE) which is a pattern-matching engine implemented in Java. JAPE uses a compiler that translates grammar rules into Java objects that target the GATE API and a library of regular expressions. JAPE can be used to develop cascades of finite state transducers.

A JAPE grammar consists of a set of rules with the following format:

Rule: RuleName
Priority: Integer
LHS --> RHS

The priority is used to control how rules that match at the same text offset should be fired. The left-hand side (LHS) is a regular expression over annotations, and the right-hand side (RHS), describes the annotation to be assigned to the piece of text matching the LHS or contains Java we want to execute when the regular expression is found in the text. The LHS is specified in terms of annotations already produced by any previous processing stage, including JAPE semantic tagging. The elements used to specify the pattern are groups of constraints specified as annotation types or as feature-values of annotation types in the following way:

({AnnotationType.AttributeName == Value, �})
({AnnotationType})

These elements can be combined to form regular expressions using the regular operators: |, *. ?, +, and the sequence constructor. For example, the following is a valid pattern that will match a numeric token:

({Token.string == number})

and the following pattern over sequences of Tokens:

(({Token.string = �$�)) ({Token.kind == number}))

matches a dollar sign followed by a number.

Labels can be associated to these constraint elements in the following way:

(Constraint):label

These labels are used in the RHS to specify how to annotate the text span(s) that matches the pattern. The following syntax is used to specify the annotation to be produced:

:label.AnnotationType = {feature=value, �}

If Java code is used in the RHS of the rule, then the labels associated with constraints in the LHS can be referenced giving the possibility of performing operations on the annotations matched by the rule.

A grammar can also be a set of sub-grammars (called phases) from which a finite state transducer is created. The phases run sequentially and constitute a cascade of finite state transducers over annotations.

GATE comes with a full information extraction system called ANNIE which can be used to detect and extract named entities of different types in English documents.

Coreference Resolution

Coreference resolution, the identification of the referent of anaphoric expressions such as pronouns or definite expressions, is of major importance for natural language applications. It is particularly important in order to identify information about people as well as other types of entities. The MUC evaluations also contributed to fuel research in this area. The current trend in coreference resolution systems is the use of knowledge-poor techniques and corpora informed methods (Mitkov, 1999). For example, current systems use gender, number, and some semantic information together with heuristics for restricting the number of candidates to examine.

In GATE, two processes enable the identification of coreference in text: an orthographical name matcher and a pronominal coreferencer algorithm (Dimitrov, Bontcheva, Cunningham, & Maynard, 2003). The orthographical name matcher associates names in text based on a set of rules, typically the full name of a person is associated with a condensed version of the same name (e.g., R. Rubin and Robert Rubin). The pronominal coreferencer used in this work uses simple heuristics identified from the analysis of a corpus of newspaper articles and broadcast news, and so it is well adapted for our task. The method assigns salience values to the antecedent (in a three sentence window) based on the rules induced from the analysis, and then chooses as antecedent of an anaphoric expression, the candidate with the best value.
Syntactic and Semantic Analysis

Syntactic and semantic analysis are carried out with SUPPLE, a freely available, general purpose parser that produces both syntactic and semantic representations for input sentences (Gaizauskas, Hepple, Saggion, Greenwood, & Humpreys, 2005). The parser is implemented in Prolog while a Java wrapper acts as a bridge between GATE and the parser, providing the input required by SUPPLE and reading back into the documents syntactic and semantic information. Access to SUPPLE functionalities in GATE is done though a plug-in.

The syntactic output of the parser consists of an analysis of the sentence according to a general purpose grammar of the English language distributed with the parser. The grammar is an attribute-valued context-free grammar which makes possible the treatment of long distance syntactic phenomena which can not be dealt with in a regular formalism such as JAPE.

The semantic output of the parser is a quasi logical form (QLF) � a set of first order terms constructed from the interpretation of each syntactic rule applied to the final parse: each rule specifies how its semantics should be constructed from the semantics of constituents mentioned in the syntactic rule.

SUPPLE intends to create a full analysis and representation of the input sentence, however this is not always possible, in which case it will not fail the analysis, instead it will provide a partial syntactic and semantic analysis of the input, which can be completed by other components such as a discourse interpreter (see Figure 2).

As an example the sentence from a profile:

Born in Shanghai, China, he was educated at Cambridge.

Figure 2 Parse tree obtained from SUPPLE.

This sentence is analysed as two chunks of information �Born in Shanghai, China� and �he was educated at Cambridge�. The analysis of the first fragment is as follows:

( nfvp ( vp ( vpcore ( nfvpcore ( av ( v "Born" ) ) ) ) ( pp ( in "in" ) ( np ( bnp ( bnp_core ( premods ( premod ( ne_np ( sem_cat "Shanghai" ) ) ) ) ( bnp_head ( ne_np ( sem_cat "China" ) ) ) ) ) ) ) ) )

Note that the analysis of the sentence is correct. The phrase �Born�� is interpreted as a non finite verb phrase � because of the verb �Born� and the parser correctly attached the prepositional phrase �in Shanghai�� to the verb �Born�. The names �China� and �Shanghai� have been interpreted as named entities and shown in the parse as �sem_cat� which is used to wrap named entities during parsing. The semantic interpretation of the fragment produces the following QLF:

bear(e1), time(e1,none), aspect(e1,simple), voice(e1,passive), in(e1,e2), name(e2,'China'), location(e2), country(e2), name(e3,'Shanghai'), location(e3), city(e3), realisation(e3,offsets(9,17)), qual(e2,e3), realisation(e2,offsets(9,24)), realisation(e1,offsets(1,24)), realisation(e1,offsets(1,24))

where the verb �Born� has been mapped into the unary predicate bear(e1), the named entities �China� and �Shanghai� are represented as country(e2) and city(e3) respectively � being in addition linked together by the predicate qual(e2,e3) � for qualifier. Finally the predicate in(e1,e2) is used to represent the attachment between the main verb �Born� and the city �Shanghai�. The constants en are used to represent entities and events in the text. Other predicates are also used to represent information such as aspectual information from the verb, name of the named entity, and where in the text (offsets) the particular entities are realized.

The analysis of the second fragment is:

( s ( np ( bnp ( pps "he" ) ) ) ( fvp ( vp ( vpcore ( fvpcore ( nonmodal_vpcore ( nonmodal_vpcore1 ( vpcore1 ( v "was" ) ( av ( v "educated" ) ) ) ) ) ) ) ( pp ( in "at" ) ( np ( bnp ( bnp_core ( bnp_head ( ne_np ( sem_cat "Cambridge" ) ) ) ) ) ) ) ) ) )

with semantic representation:

pronoun(e5,he), realisation(e5,offsets(26,28)), educate(e4), time(e4,past), aspect(e4,simple), voice(e4,passive), at(e4,e6), name(e6,'Cambridge'), location(e6), city(e6), realisation(e6,offsets(45,54)), realisation(e4,offsets(29,54)), realisation(e4,offsets(29,54)), lobj(e4,e5)

Syntactic and semantic information is particularly relevant for information extraction and question answering. In fact, suppose that we wanted to extract information on family relations from text: given a sentence such as �David is married to Victoria� we would like to extract a semantic relation is_spouse_of, and state that �David� is_spouse_of �Victoria� and �Victoria� is_spouse_of �David�. Because �David� and �Victoria� are respectively the logical subject and the logical object of the verb �to marry� in the sentence, they can be used to extract this particular type of family relation. Note that the same relation should be extracted for a sentence such as �David is Victoria�s husband�, and in this case again syntactic information is very handy. The same applies for question answering; use of relations from text can be used as evidence for preferring an answer over a set of possible answer candidates.

In relational extraction syntactic and semantic information have been shown to play a significant role in machine learning approaches to relation extraction (Wang, Li, Bontcheva, Cunningham & Wang, 2006).

Summarization Toolkit

A number of domain independent general purpose summarization components are available in GATE through a plug-in1. The components are designed to create sentence extracts. Following the GATE philosophy, the objective is not to provide the best possible summarization system, but an adaptable tool for the development, testing and deployment of customisable summarization solutions (Saggion, 2002). The core of the toolkit is a set of summarization modules which compute numeric features for each sentence in the input document � the value of the feature indicates how relevant the information in the sentence is for the feature. The computed values � which are normalised yielding numbers in the interval [0..1] - are combined in a linear formula to obtain a score for each sentence which is used as the basis for sentence selection. Sentences are ranked based on their score and top ranked sentences selected to produce an extract. Many features implemented in this tool have been suggested in past research as valuable for the task of identifying sentences for creating summaries (see (Mani, 2001) for example). An example summary obtained with the tool can be seen in Figure 3.

Figure 3 Single document summary.
A corpus statistic module computes token statistics including term frequency - the number of times each term occurs in the document (tf). The vector space model has been implemented and it is used to create vector representations of different text fragments � usually sentences but also the full document. Each vector contains for each term occurring in the text fragment, the value tf*idf (term frequency * inverted document frequency). The inverted document frequency of a given term is the number of documents in a collection containing the term. These values can be loaded into the system from a table or can be computed on the fly by the summarization tool. With the latter option the values can be then be saved for future use.

The term frequency module computes the sum of the tf*idf of all terms in each sentence � note that because frequent terms such as �the� have close to zero idf value, then their contribution to the term frequency feature is minimal. These values are normalised to yield numbers between 0 and 1. In a similar way, a named entity scorer module computes the frequency of each named entity in the sentence. This process is not based on the frequency of named entities in a corpus but on the frequency of named entities in the input document. A named entity occurring less frequently is more valuable than a named entity observed across different sentences.

A content analysis module is used to compute the similarity between two text fragments in the document represented in the vector space � for example between a sentence and the title of the document or between a sentence and the full document. The measure of similarity is the cosine of the angle between the two vectors. These values can be stored as sentence features and used in the scoring formula.

The sentence position module computes two features for each sentence: the absolute position of the sentence in the document and the relative position of the sentence in the paragraph. The absolute position of sentence i receives value i-1 while the paragraph feature receives a value which depends on the sentence being in the beginning, middle or end of paragraph � these values are parameters of the system.

For a cluster of related documents, the system computes the centroid of the set of document vectors in the cluster. The centroid is a vector of terms and values which is in the center of the cluster. The value of each term in the centroid is the average of the values of the terms in the vectors created for each document.

The similarity of each sentence in the cluster to the centroid is also computed using the cosine metric. This value is stored as a sentence feature and used during sentence scoring in multi-document summarization tasks (Saggion and Gaizauskas, 2004). A multidocument summary is presented in Figure 4.

In order to support redundancy detection, resources for computing n-grams are also available. A metric for redundancy detection has been implemented which computes how close two text fragments are according to the proportion of n-grams they share.

Summarization is a very important topic in natural language processing and it would be impossible to describe here all existent approaches and tools. Probably close to our approach is the MEAD toolkit (Radev et al, 2004) which provides methods to compute features such as position, centroid similarity, etc. and to combine them with appropriate weights.

Figure 4 Centroid-based multi-document summary.

CASE STUDIES

GATE and summarization components have been used to solve two problems identified by the natural language processing community: definitional question answering and profile-based text summarization. Here we give an overview of how the tools we have described have been used to create practical and effective solutions.
Definitional Question Answering

The problem of finding definitions in vast text collections is related to the TREC QA definition subtask, where given a huge text collection like AQUAINT (over 1 million texts from the New York Times, the AP newswire, and the English portion of the Xinhua newswire and totaling about 3.2 gigabytes of data) and a definition question like What is Goth? or Who is Aaron Copland?, an automatic system has to find text fragments that convey essential and non-essential characteristics of the main question term (e.g., Goth or Aaron Copland). This is a challenging problem not only because of the many ways in which definitions can be conveyed in natural language texts but also because the definiendum (i.e., the thing to be defined) has not, on its own, enough discriminative power to allow selection of definition-bearing passages from the collection.

In the TREC 2003 QA definition subtask evaluation, participants used various techniques similar to those we are going to present here. Top ranked groups report on the use of some form of lexical resource like WordNet, the Web for answer redundancy, patterns for definition identification and sophisticated linguistic tools (Kouylekov, Magnini, Negri, and Tanev, 2003; Harabagiu, Moldovan, Clark, Bowden, Williams, and Bensley, 2003).

BBN's definitional system (Xu, Licuanan, and Weischedel, 2003) that obtained the best performance in TREC QA 2003 relies on the identification, extraction, and ranking of kernel facts about the question target (i.e. definiendum) followed by a redundancy removal step. The system uses sophisticated linguistic analysis components such as parsing and coreference resolution. First, sentences containing the question target in the top 1000 documents retrieved by an information retrieval system are identified; then, kernel facts are identified in those sentences using criteria such as the presence of copula or appositive constructions involving the question target, or matching of a number of structural patterns (e.g., TERM is a NP), or containing special predicate-argument structures (e.g., PERSON was born on DATE), or presence of specific relations (e.g., spouse of, staff of); finally, kernel facts are ranked by a metric that takes into account their type and their similarity (using tf*idf metric) to a question profile constructed from on-line sources or from the set of identified kernel facts.

QUALIFIER (Yang, Cui, Maslennikov, Qiu, Kan, and Chua., 2003) obtained the second best performance using a data-driven approach to definitional QA. The system uses linguistic tools such as fine-grained named entity recognition and coreference resolution. WordNet and the Web are used to expand the original definition question to bridge the semantic gap between query space and document space. Given a set of documents retrieved from AQUAINT after query expansion, extractive techniques similar to those used in text summarization are applied. The basic measure used to score sentences is a logarithmic sum of a variant of the tf*idf measure for each word. This metric scores a word proportional to the number of times it appears in sentences containing the definiendum and inversely proportional to the number of times it appears in sentences that do not contain the definiendum. Scores for words are computed from two sources: AQUAINT sentences and Web sentences. Sentence scores are first computed using word scores obtained from AQUAINT and Web and then these scores are combined in a linear way to obtain the sentence final value. Once all sentences have been evaluated and ranked, an iterative redundancy removal technique is applied to discard definitional sentences already in the answer set.

The DefScriber system (Blair-Goldensohn, McKeown, and Schlaikjer 2003) combines world knowledge in the form of definitional predicates (genus, species, and non-specific) and data-driven statistical techniques. World knowledge about the predicates is created relying on machine learning over annotated data. Data driven techniques including the vector space model and cosine distance are used to exploit the redundancy of information about the definiendum in non-definitional Web texts. Fleischman, Hovy, and Echihabi (2003) create pairs of concept instances such as �Bill Clinton-president� mining newspaper articles and web documents. These pairs constitute pre-available knowledge that can be used to answer �Who is?� questions. They use lexico-syntactic patterns learnt from annotated data to identify such pairs. The knowledge acquired can be used to answer two different types of definition questions �Who� and �What�.

Our Approach

In order to find good definitions, it is useful to have a collection of metalanguage statements (i.e., DEFINIENDUM is a, DEFINIENDUM consists of, etc.) which implement patterns for identification and extraction of ``definiens'' (the statement of the meaning of the definiendum). Unfortunately there are so many ways in which definitions are conveyed in natural language that it is difficult to come up with a full set of linguistic patterns to solve the problem. To make matters more complex, patterns are usually ambiguous, matching non-definitional contexts as well definitional ones. For example, a pattern like Goth is a to find definitions of Goth, will match Becoming a goth is a process that demands lots of effort as well as Goth is a subculture.

We will describe a method that uses external sources to mine knowledge which consists of terms that co-occur with the definiendum before trying to define it using the given text collection. This knowledge is used for definition identification and extraction. There are two sources of knowledge we rely on for finding definitions: linguistic patterns, which represent general knowledge about how definitions are expressed in language; and secondary terms, which represent specific knowledge about the definiendum outside the target collection.

Linguistic Patterns

Definition patterns or metalanguage statements containing lexical, syntactic, and sometimes semantic information have been used in the past in research in terminology (Pearson, 1998), ontology induction (Hearst, 1992), and text summarization (Saggion & Lapalme, 2002) among others.

When a corpus for specific purposes is available, then patterns can be combined with well formed terms or specific words to restrict their inherent ambiguity. One simple formal defining expositive proposed by (Pearson, 1998) is X = Y + distinguishing characteristics where possible fillers for X are well formed terms (those word sequences following specific patterns), fillers for Y are terms or specific words from a particular word list (e.g., method, technique, etc.), and fillers for = are connective verbs such as to be, consist or know. The use of predefined word lists or term formation criteria is, however, not possible in our case because we are dealing with a heterogeneous text collection where the notion of term is less precise than in a corpus of a particular domain.

Dictionaries are good sources for the extraction of definition knowledge. Recent research in classification and automatic analysis of dictionary entries (Barnbrook, 2002) has shown that a limited number of strategies for expressing meaning in those sources exists and that automatic analysis can be carried out on those sources to extract lexical knowledge for natural language processing tasks. Barnbrook (2002) identified 16 types of definitions in the Cobuild student's dictionary and extraction patterns used to parse them (e.g. A/an/The TERM is/are a/an/the...). The question remains as to whether this typology of definition sentences (and associated extraction patterns) is sufficient to identify definition statements in less structured textual sources.

We have collected, through corpus analysis and linguistic intuition, a useful set of lexical patterns to locate definition-bearing passages. The purpose of these patterns is on the one hand to obtain definition contexts for the definiendum outside the target collection in order to mine knowledge from them, and on the other hand to use them for extracting definiens from the target collection. 36 patterns for general terms and 33 patterns for person profiles have been identified, a sample can be seen in Table 1, patterns used in this work contain only lexical information.

General patterns
Person patterns
define TERM as
PERSON known for
TERM and others
PERSON who was
TERM consists of
PERSON a member of

Table 1 Definition/Profile patterns.

In this case, we have implemented the pattern matching process in GATE gazetteers instead of with JAPE grammars. This has the advantage of speeding up the matching process. The gazetteer lists are generated on-the-fly, for each term or person of interest we instantiate all the patterns and write them to gazeteer lists. For example if �Aaron Copland� is the person to be profiled (which is extracted from the question by perfoming syntactic analysis using SUPPLE) then the strings Aaron Copland known for, Aaron Copland who was and Aaron Copland a member of are all stored in the gazetteer lists files for the entity Aaron Copland and uploaded into the system to carry out linguistic analysis.
Secondary Terms

Terms that co-occur with the definiendum (outside the target collection) in definition-bearing passages seem to play an important role for the identification of definitions in the target collection. For example, in the AQUAINT collection there are 217 sentences referring to Goth, but only a few of them provide useful definitional contexts. We note that the term subculture usually occurs with Goth in definitional contexts on the Web, and there are only 6 sentences in AQUAINT which contain both terms. These 6 sentences provide useful descriptions of the term Goth such as the Goth subculture and the gloomy subculture known as Goth. So, the automatic identification of specific knowledge about the definiendum seems crucial in this task.

Our method considers nouns, verbs and adjective as candidate secondary terms (so a process of part-of-speech tagging is essential). Sources for obtaining definition-passages outside AQUAINT for mining secondary terms are the WordNet lexical database (Miller 1995), the site of Encyclopaedia Britannica, and general pages on the web (in further experiments we have used Wikipedia instead of Britannica as a trusted source for containing definitional contexts). The passages are obtained automatically from the Web by using the Google API exact search facility for each definition pattern (e.g. Aaron Copland who was).

DEFINIENDUM
SECONDARY TERMS
Aaron Copland
music, american, composer, classical, appalachian, spring, brooklyn, etc.
golden parachutes
plans, stock, executive, compensation, millions, generous, top, etc
Table 2 Tems that co-occur with the definiendum in definition bearing passages.

Terms that co-occur with the definiendum are obtained following three different methods: (i) words appearing in WordNet glosses and hypernyms of the definiendum are extracted; (ii) words from Britannica sentences are extracted only if the sentence contains an explicit reference to the definiendum; (iii) words from other Web sentences are extracted only if the sentences match any definition pattern. Extracted terms are scored based on their frequency of occurrence. Table 2 shows top ranked terms mined from on-line sources for Aaron Copland (famous American musician who composed the ballet Appalachian Spring) and golden parachutes (compensation given to top executives that is very generous).

Identifying Definitions in Texts

In order to select text passages from the target text collection we relied on a document retrieval system which returns relevant text paragraphs. It is worth noting that GATE integrates the Lucene Information Retrieval system (http://lucene.apache.org) which is appropriate in this task (we have used it in further experiments). Different strategies can be used to select good passages from the collection, one strategy we have used consisted of an iterative process which uses the target term and if too many passages are returned, then secondary terms are used in conjunction with the target term to narrow the search.

We perform a linguistic analysis of each returned passage which consists of the following steps provided by GATE: tokenisation, sentence splitting, matching using the definiendum and any of the definiendum's secondary terms, and pattern matching using the definition patterns (this is a GATE gazetteer lookup process). We restrict our analysis of definitions to the sentence level. A sentence is considered a definition-bearing sentence if it matches a definition pattern or if it contains the definiendum and at least three secondary terms.

We perform sentence compression extracting a sentence fragment that is a sentence suffix and contains main and all secondary terms appearing in the sentence, this is done in order to avoid the inclusion of unnecessary information the sentence may contain. For example the definition of Anthony Blunt extracted from the sentence.

The narrator of this antic hall-of-mirrors novel, which explores the compulsive appeal of communism for Britain's upper classes in the 1930s, is based on the distinguished art historian Anthony Blunt, who was named as a Soviet spy during the Thatcher years.

art historian Anthony Blunt, who was named as a Soviet spy during the Thatcher years.

All candidate definitions are proposed as answers unless they are too similar to any of the previous extracted answers. We measure similarity of a candidate definition to a previously extracted definition from the collection using tf*idf and the cosine similarity measure taken from the summarization toolkit.

The method described here was used in the recent TREC QA 2003 competition in the definition task. This task required finding answers for 50 definition questions. The set consisted of 30 ``Who'' definition questions and 20 ``What'' definition questions. TREC assessors created for each question a list of acceptable information nuggets (pieces of text) from all returned system answers and information discovered during question development. Some nuggets are considered essential (i.e., a piece of information that should be part of the definition) while others are considered non-essential. During evaluation, the assessor takes each system response and marks all essential and non-essential nuggets contained in it. A score for each question consists of nugget-recall (NR) and nugget-precision (NP) based on length. These scores are combined in the F-score measure with recall five times as important as precision. We obtained a combined F-score of 0.236. The F-score of the systems that participated in the competition is 0.555 (best), 0.192 (median), 0.000 (worst). Our method was considered among the top 10 out of 25 participants. The method was improved with the incorporation of automatically induced definition patterns which are used in addition to the manually collected patterns and with a considerable increase in the number of documents retrieved for analysis (Gaizauskas, Greenwood, Hepple, Saggion, and Sargaison, 2004).

Profile-based Summarization

The National Institute of Standards and Technology (NIST) with support from the Defence Advanced Research Projects Agency (DARPA) is conducting a series of evaluations in the area of text summarization, the Document Understanding Conferences (DUC), providing the appropriate framework for system-independent evaluation of text summarization systems. In DUC 2004, one multidocument summarization task consisted of the following: Given a document cluster and a question of the form Who is X?, where X is the name of a person, create a short multi-document summary of the cluster that responds to the question. This task seems less complicated than the previously described task in that the process of finding the input passages for profiling is not necessary.

One of the first multidocument summarization systems to produce summaries from templates was SUMMONS (Radev&McKeown, 1998). One of the key components of the system which is relevant to research on profile creation, is a knowledge base of person profiles which support the summarizer in a process of generation of descriptions during summary generation. The database is populated from on-line sources by identifying descriptions using a set of linguistic patterns which represent pre-modifiers or appositions. Schiffman et al (2001) use corpus statistics together with linguistic knowledge to identify and weight descriptions of people to be included in a biography. Syntactic information is used to identify appositives describing people as well as sentences where the target person is the subject of the sentence. The mutual information statistic computed between verbs and subjects in a corpus is used to score and rank descriptions of the sought entity. Zhou et al (2004) use content reduction techniques as proposed by (Marcu, 1999), however the initial content of the summary is identified by a sentence classifier trained over a corpus of annotated biographies. The classifier identifies in text sentences referring to different aspects of a person's life.

Many multidocument summarization algorithms take advantage of the redundancy of information to measure relevance in order to decide which sentences in the cluster to select. However, one of the best performing systems in the profile-based summarization task at DUC 2004 used syntactic criteria alone to select sentences in order to create a profile. Only two types of construction were used in that work: appositive and copula constructions which both rely on syntactic analysis (Lacatusu, Hick, Harabagiu, and Nezd, 2004).

We follow a pattern-based shallow approach to the identification of profile information combined with a greedy search algorithm informed by corpus statistics. We will show that our proposed solution is ranked consistently high on the DUC 2004 data.

Sentence Extraction System

Here, we focus on the problem of extracting from a document cluster the necessary information to create the person's profile; the problem of synthesis - the production of a coherent and cohesive biography - will not be discussed. In order to select the content for a profile we have developed a sentence selection mechanism which, given a target person and a cluster of documents referring to the target, extracts relevant content from the cluster and produces a summary which will contain information such as the person�s age and profession (e.g., �Hawking, 56, is the Lucasian Professor of Mathematics��), and life events. (e.g., �Hawking, 56, suffers from Lou Gehrig's Disease, which affects his motor skills��). The summary is created with sentence fragments from the documents, which have been analysed by a number of natural language processing components. The main steps in the process are:

* First, a pool of relevant sentences is identified as candidates in the input documents using a pattern-matching algorithm;

* Second, redundancy removal is carried out to eliminate from the pool of sentences repeated information

* Finally, the set of candidate sentences is reduced to match the required compression rate by a greedy sentence rejection mechanism;

Two pieces of information are key in the process: the GATE coreference resolution algorithm and a pattern matching mechanism, implemented in JAPE, which targets specific contexts in which a person is mentioned.

Pre-Processing

We carry out document analysis using GATE tools discussed above. Modifications to the sentence identification module have been carried out to deal with the format of the input documents, the named entity recogniser has been adapted in the following way: we have created, on-the-fly and for each target, gazetteer lists needed to identify the target entity in the text. These lists contain the full name of the target person and his/her last name. We also provide gender information to the named entity recogniser by identifying the gender of the target in the input set. Using the distribution of male and female pronouns in the person cluster, the system guesses the gender of the target entity (the most frequent gender). This information is key for the coreference algorithm which uses information provided by the named entity recogniser to decide upon coreferent pronouns with the target. GATE part-of-speech tagging is also used as is a noun phrase chunker available in GATE as a plug-in, however no syntactic or semantic analysis were necessary for the experiments reported here. After coreference resolution a step identifies a coreference chain - expressions referring to the same entity or the target entity - and marks each member of the chain so that a pattern matching mechanism can be applied.

Content Selection

In order to select sentence candidates to create an extract we rely on a number of patterns that have been proposed in the past to identify descriptive phrases in text collections. A list of patterns used in the system is given in Table 1 and have been proposed by Joho and Sanderson (2000). In this specification, DP is a descriptive phrase which is taken to be a noun phrase.

Our implementation of the patterns makes use of coreference information so that target is any expression in text which is coreferent with the sought person. We have implemented the patterns in JAPE. In order to implement the DP element in the patterns we use the information provided by the noun phrase chunker

For example a variation of the first pattern is implemented in JAPE as follows:

Rule: KeyPhrase1

({KeyPerson}
({ Token.string == "is" } |
{Token.string == "was" })
{NounChunk}):annotate -->
:annotate.KeyPhrase = {}

Note that because the NounChunk annotation produced by the noun chunker covers initial determiners, there is no need to make determiners explicit in the JAPE rule.

The process of content selection is simple: a sentence is considered a candidate for the extract if it matches a pattern (i.e. has an annotation of type �KeyPhrase�). We perform sentence compression by removing from the candidate sentence the longest suffix which does not match a pattern. The selected sentences are sorted according to their similarity to the centroid of the cluster (using the tools from the summarization toolkit). In order to filter out redundant information, we use our n-gram similarity detection metric in the following way: a pattern-matched sentence is included in a list of candidate sentences if it is different from all other candidates in the list. In order to implement such a procedure, a threshold for our n-gram similarity metric has to be established so that one can decide whether two sentences contain different information. Such a threshold can be obtained if one has a corpus annotated with sentences known to be different. As such a corpus is not available to us, we make the hypothesis that in a given document all sentences will report different information, therefore we can use the n-gram similarity values between them to help estimate a similarity threshold. We computed pair wise n-gram similarity values between sentences in documents and have estimated a threshold for dissimilarity as the average of the pair wise similarity values.

Greedy Sentence Removal

Most sentence extraction algorithms work in a constructive way: given a document and a sentence scoring mechanism, the algorithm ranks sentences by score, and then chooses sentences from the ranked list until a compression rate is reached. We take a different approach which consists in removing sentences from a pool of candidate sentences until the desired compression is achieved. The question is, given that an exhaustive search is implausible, how to reduce the given candidate set so that the content is optimal. It follows a similar approach to Marcu's (1999) algorithm for the creation of extracts from pairs of (document,abstracts). In his approach clauses from the document are greedily deleted in order to obtain an extract which is maximally similar to the abstract. In our case, as we do not have an oracle which gives us the ideal abstract we want to construct, we assume that the candidate list of sentences which refer to the person target is the ideal content to include in the final summary.

Given a set of sentences C which cover essential information about a person, the algorithm creates an extract which is close in content to C but which is reduced in form. The measure of proximity between documents is taken as the cosine between two vectors of term representations of the documents (as in an information retrieval context). At each step, the algorithm greedily rejects a sentence from the extract. The rejected sentence is one which if removed from the extract produces a pseudo-document which is closer to C among all other possible pseudo-documents. The algorithm is first called with a vector of terms created from the candidate list of sentences - obtained using the sentence selection and redundancy removal step, the candidate list of sentences, and a given compression rate. Note that summarisation by sentence rejection has a long history in text summarisation: it has been used in the ADAM system (Pollock & Zamora, 1975) to reject sentences based on a cue-word list, and also in the British Library Automatic Abstracting Project (BLAB) in order to exclude sentences with dangling anaphora which cannot be resolved in context (Johnson, Paice, Black, Neal, 1993).

Evaluation

The data used in the experiments reported here is the DUC 2004 Task 5 data which consists of 50 Who is X? questions and 50 document clusters (one per question): each cluster contained around 10 documents from news agencies. For each of the clusters in the data set, human analysts have created ideal or referent summaries against which the peer (system) summaries are compared. In order to take advantage of the document markup, we have transformed the original documents into XML in such a way that the processing components can concentrate on the textual information of the document alone. Given the question target and the document cluster we have created 665-byte long summaries following the method described below. We have followed the method used in DUC 2004 to evaluate the content of the automatic summaries and have compared our system against other algorithms.

Since human evaluation requires human judgements and these are expensive to obtain, automatic evaluation metrics for summary quality have been the focus of research in recent years (Saggion, Radev, Teufel, and Lam, 2002). In particular, the Document Understanding Conferences have adopted ROUGE (Lin, 2004), a statistic for automatic evaluation of summaries. ROUGE allows the computation of recall-based metrics using n-gram matching between a candidate summary and a reference set of summaries. The official DUC 2004 evaluation is based on six metrics ROUGE-N (N=1,2,3,4) based on n-gram matches, ROUGE-L, a recall metric based on the longest common subsequence match between peer and ideal summary, and ROUGE-W which is a weighted longest common subsequence that takes into account distances when applying the longest common subsequence. When multiple references are available in an evaluation, the ROUGE statistic is defined as the best score obtained by the summary when compared to each reference. Recent experiments have shown that some ROUGE scores correlate with rankings produced by humans. According to ROUGE scores, our pattern-based summarizer consistently obtains the highest scores for all ROUGE metrics. The other algorithms we compared our system to are generally less consistent in ranking. A system from Columbia University is also rather consistent obtaining the second score for all metrics but ROUGE-4.

While evaluation using humans are desirable because they measure how helpful summaries are in a given tasks, automatic evaluation based on multiple reference summaries makes it possible comparison across sites. Task-based evaluation of summaries was the focus of SUMMAC � the first large scale task-based evaluation of text summarization systems (Mani, Klein, House, Hirschman, Firmin, & Sundheim, 2002).

CONCLUSION

With the development of the Internet and the ever increasing availability of on-line textual information, automating the process of extracting relevant information about real entities or common terms has became increasingly important for intelligence gathering activities and knowledge management among others. In this chapter we have described two specific text mining tasks: creating profiles from a cluster of relevant documents and finding definitions in massive open text collections. We have described robust yet adaptable tools used as the analytical apparatus for implementation. The methods proposed for both tasks have been proved very effective as demonstrated by results obtained in international evaluations in natural language processing.

In spite of the success, considering how far automatic systems are from human performance in the same tasks, much needs to be done. An issue that needed to be tackled when extracting and combining information from multiple documents is that of cross-document coreference. In fact, the systems described here rely on identification of the name of the entity to be targeted, but they do not identify the referent of the entity or whether or not the same mention refers to the same entity in the real world. Some work has been carried out in the past few years in this subject and it is a problem which forms part of our research agenda. We have made little use of syntactic and semantic information here, it should be noted, however, that a great deal can be learned from the semantic analysis of the text: for example syntactic and semantic analysis of a corpus of profiles can bring to bear a number of predicates and arguments typically used to express core biographical facts which in turn could be used to detect that information in new texts.

Another issue we have not addressed here is that of presentation of information: how a coherent well formed profile or answer to a definition question is to be presented. Problems to be addressed here are sentence ordering, sentence reduction, and sentence combination.

Approaches described here are largely unsupervised: the patterns used are inspired by corpus analysis. But, with the increasing interest of corpus-based, statistical techniques in natural language processing it remains an open question whether better performance could be obtained by exploring machine learning approaches in all or some parts of the applications described here.

ACKNOWLEDGMENTS
The work described in this chapter was produced while the author was working for the Cubreporter Project (EPSRC Grant R91465). The writing of the chapter was produced while the author was working for the EU Musing Project (IST-2004-027097).

References

Barnbrook, G. (2002) Defining Language. A local grammar of definition sentences. John Benjamins Publlishing Company.

Brill, E. (1995) Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics. 21(4). pp543-565

Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002) GATE: A framework and graphical development environment for robust NLP tools and applications. In: ACL 2002.

Dimitrov, M., Bontcheva, K., Cunningham, H., & Maynard, D. (2004) A Light-weight Approach to Coreference Resolution for Named Entities in Text. In A. Branco, T.M., & Mitkov, R., eds.: Anaphora Processing: Linguistic, Cognitive and Computational Modelling. John Benjamins Publishing Company.

Gaizauskas, R, Greenwood, M., Hepple, M, Roberts, T, Saggion, H., & Sargaison, M. (2004) The University of Sheffield's TREC 2004 Q&A. Proceedings of TREC 2004.

Gaizauskas, R., Hepple, M. Saggion, H., Greenwood, M. & Humpreys, K. (2005). SUPPLE: A practical parser for natural language engineering applications. Proceedings of the International Workshop on Parsing technologies.

Harabagiu, S., Moldovan, D., Clark, M., Bowden, J., Williams, & Bensley, J. (2003) Answer Mining by Combining Extraction Techniques with Abductive Reasoning. In Proceedings of TREC-2003.

Hearst, M.A. (1992) Automatic Acquisition of Hyponyms from Large Text Corpora. In Proceedings of COLING�92, Nantes.

Johnson, F.C., Paice, C.D., Black, W.J., & Neal, A. (1993) The application of linguistic processing to automatic abstract generation. Journal of Document & Text Management 1 215-241.

Joho, H., & Sanderson, M. (2000) Retrieving Descriptive Phrases from Large Amounts of Free Text. In: Proceedings of Conference on Information and Knoweldge Management (CIKM), ACM 180-186.

Kouylekov, M., Magnini, B., Negri, M., & Tanev, H (2003) ITC-irst at TREC-2003: the DIOGENE QA system. In Proceedings of TREC-2003.

Lacatusu, F., Hick, L., Harabagiu, S., & Nezd, L. (2004) Lite-GISTexter at DUC2004. In: Proceedings of DUC 2004, NIST.

Lin.C.-Y. (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In: Proceedings of the Workshop on Text Summarization, Barcelona, ACL.

Mani, I. (2001) Automatic Text Summarization. John Benjamins Publishing Company.

Mani, I. & Klein, G. & House, D. & Hirschman, L. & Firmin, T. & Sundheim, B. (2002) SUMMAC: a text summarization evaluation. Nat. Lang. Eng. 8, 1, 43-68

Marcu, D. (1999) The automatic construction of large-scale corpora for summarization research. In Hearst, M., F., G., Tong, R., eds.: Proceedings of SIGIR'99. 22nd International Conference on Research and Development in Information Retrieval, University of California, Beekely 137-144.

Maynard, D. & Bontcheva, K. & Cunningham, H. (2003) Towards a semantic extraction of named entities. In Proceedings Recent Advances in Natural, Borovets, Bulgaria.

Miller, G.A. (1995) WordNet: A Lexical Database. Communications of the ACM, 38(11):39�41, November. Pearson, J. (1998) Terms in Context, volume 1 of Studies in Corpus Linguistics. Jhon Benjamins Publishing Company.

Mitkov, R. (1999) Anaphora resolution: the state of the art, University of Wolverhampton, Wolverhampton.

Pearson, V. (1998) Terms in Context, volume 1 of Studies in Corpus Linguistics. John Benjamins Publishing Company.

Pollock, J., & Zamora, A. (1975) Automatic abstracting research at Chemical Abstracts Service. Journal of Chemical Information and Computer Sciences 226-233.

Radev, D. & Allison, T. & Blair-Goldensohn, S. & Blitzer, S. & �elebi, A. & Dimitrov, S. & Drabek, D. & Hakim, A. & Lam, W. & Liu, D. & Otterbacher, J. & Qi, H. & Saggion, H. & Teufel, S. & Topper, M. & Winkel, A. & Zhu, Z. MEAD - a platform for multidocument multilingual text summarization. In Proceedings of LREC 2004, Lisbon, Portugal, May 2004.

Radev, D.R. & McKeown, K.R. (1998). Generating natural language summaries from multiple on-line sources. Computational Linguistics 24 (1998) 469-500.

Saggion, H., & Gaizauskas, R. (2004a) Mining on-line sources for defnition knowledge. In: Proceedings of the 17th FLAIRS 2004, Miami Bearch, Florida, USA, AAAI.

Saggion, H., & Gaizauskas, R. (2004b) Multi-document summarization by cluster/profile relevance and redundancy removal. In: Proceedings of the Document Understanding Conference 2004, NIST.

Saggion, H., & Gaizauskas, R. (2005) Experiments on statistical and pattern-based biographical summarization. TEMA Workshop.

Saggion, H., & Lapalme, G. (2002) Generating Indicative-Informative Summaries with SumUM. Computational Linguistics, 28(4):pp497�526.

Saggion, H. (2002) Shallow-based Robust Summarization. ATALA Workshop. Paris.

Saggion, H., Radev, D., Teufel, S., & Lam, W. (2002) Meta-evaluation of Summaries in a Cross-lingual Environment using Content-based Metrics. In: Proceedings of COLING 2002, Taipei, Taiwan 849-855.

Schiffman, B. & Mani, I. & Concepcion, K. (2001). Producing Biographical Summaries: Combining Linguistic Knowlkedge with Corpus Statistics. In: Proceedings of EACL/ACL.

Wang, T. & Li, Y. & Bontcheva, K. & Cunningham, H. & Wang, J (2006) Automatic Extraction of Hierarchical Relations from Text. Proceedings of the Third European Semantic Web Conference (ESWC 2006), Lecture Notes in Computer Science 4011, Springer, 2006

Xu, J., Licuanan, A., & Weischedel, R. (2003) TREC2003 QA at BBN: Answering Definitional Questions. In Proceedings of TREC-2003.

Yang, H., Cui, H., Maslennikov, M., Qiu,L., Kan, M.-Y., & Chua, T.-S. (2003) QUALIFIER in TREC-12 QA Main Task. In Proceedings of TREC-2003.

Zhou, L., Ticrea, M., & Hovy, E. (2004) Multi-document Biography Summarization. In: Proceedings of Empirical Methods in Natural Language Processing

.
FUTURE RESEARCH DIRECTIONS

Recent years have witnessed an explosion of textual information on-line making human language technology and in particular text summarization and question answering important for helping humans make informed decisions about the content of particular sources of information. On the issue of text analysis, future directions in profile and definition mining have to address the problems of cross-document event and entity coreference which arises when the same event or entity is described in multiple documents. In this context important research issues have to be investigated: identification of similar information in different sources and identification of complementary and contradictory information across sources. In order to develop solutions in this area, it is important the creation of standard data sets for development, testing, and evaluation. Approaches relying on superficial features are likely to produce rapid and robust solutions; however attention has to be paid to knowledge-based approaches which can be made portable from one domain to another by the application of machine learning techniques. On the issue of information presentation, techniques are required for producing good quality summaries and complicated answers. With the availability of massive text collections, progress is expected in the area of trainable language generation for profile generation. It is important to note that human language technology faces new challenges with the adoption of Web 2.0 technology because of its multi-source and multi-lingual nature. Intelligent access to information in collaborative environments with summarization and question answering technology will certainly be a target of current and future applications.

ADDITIONAL READING

Bagga, A. & Baldwin, B (1998) Entity-Based Cross-Document Coreferencing Using the Vector Space Model.. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL'98)

Document Understanding Conferences (DUC) http://duc.nist.gov/

Phan. X.-H, Nguyen, L.-M & Horiguchi, S (2006) Personal Name Resolution Crossover Documents by a Semantics-Based Approach IEICE Transactions on Information and Systems 2006 E89-D(2):825-836.

Text Retrieval Conferences � Question Answering Track (TREC/QA) http://trec.nist.gov/data/qa.html

Blair-Goldensohn, S.; McKeown, K.; & Schlaikjer, A. 2003. A hybrid approach for answering definitional questions. In Processdings of the 26th ACM SIGIR Conference. Toronto, Canada: ACM.

Sierra, G.; Medina, A.; Alarc�n, R.; & Aguilar, C. 2003. Towards the Extraction of Conceptual Information From Corpora. In Archer, D.; Rayson, P.; Wilson, A.; and McEnery, T.(Eds)., In Proceedings of the Corpus Linguistics 2003 Conference, 691�697. University Centre for Computer Corpus Research on Language.

Fleischman, M.; Hovy, E.; and Echihabi, A. 2003. Offline strategies for online question answering: Answering questions before they are asked. In Proceedings of the ACL 2003, 1�7. ACL.

Ravichandran, D., and Hovy, E. 2002. Learning Surface Text Patterns for a Question Answering System. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 41�47.

Hildebrandt, W., Katz, B. & Lin, J. (2004) Answering Definition Questions Using Multiple Knowledge Sources. In Proceedings of HLT-NAACL 2004, 49�56.

Chen, Y., Zhou, M., & Wang, S.: Reranking Answers for Definitional QA Using Language Modeling. In Proceedings of COLING/AACL 2006.

1 http://www.dcs.shef.ac.uk/~saggion/summa/default.htm
??