Log in Help
Print
Homereleasesgate-7.0-build4195-ALLdoctao 〉 splitch16.html
 

Chapter 16
Domain Specific Resources [#]

As soon as more-or-less faithful replication has evolved, then natural selection begins to work. To say this is not to invoke some magic principle, some deus ex machina; natural selection in this sense is a logical necessity, not a theory waiting to be proved. It is inevitable that those cells more efficient at capturing and using energy, and of replicating more faithfully, would survive and their progeny spread; those less efficient would tend to die out, their contents re-absorbed and used by others. Two great evolutionary processes occur simultaneously. The one, beloved by many popular science writers, is about competition, the struggle for existence between rivals. Darwin begins here, and orthodox Darwinians tend both to begin and end here. But the second process, less often discussed today, perhaps because less in accord with the spirit of the times, is about co-operation, the teaming up of cells with particular specialisms to work together. For example, one type of cell may evolve a set of enzymes enabling it to metabolise molecules produced as waste material by another. There are many such examples of symbiosis in todayâĂŹs multitudinous world. Think, amongst the most obvious, of the complex relationships we have with the myriad bacteria – largely Escherichia coli – that inhabit our own guts, and without whose co-operation in our digestive processes we would be unable to survive. In extreme cases, cells with different specific specialisms may even merge to form a single organism combining both, a process called symbiogenesis.

Symbiogenesis is now believed to have been the origin of mitochondria, the energy-converting structures present in all of todayâĂŹs cells, as well as the photosynthesising chloroplasts present in green plants.

Stephen Rose, The Future of the Brain: The Promise and Perils of TomorrowâĂŹs Neuroscience, 2005, (p. 18).

The majority of GATE plugins work well on any English languages document (see Chapter 15 for details on non-English language support). Some domains, however, produce documents that use unusual terms, phrases or syntax. In such cases domain specific processing resources are often required in order to extract useful or interesting information. This chapter documents GATE resources that have been developed for specific domains.

16.1 Biomedical Support [#]

Documents from the biomedical domain offer a number of challenges including a highly specialised vocabulary, words that include mixed case and numbers requiring unusual tokenization as well as common English words used with a domain specific sense. Many of these problems can only be solved through the use of domain specific resources.

Some of the processing resources documented elsewhere in this user guide can be adapted with little or no effort to help with processing biomedical documents. The Large Knowledge Base Gazetteer (Section 13.9) can be initialized against a biomedical ontology such as Linked Life Data in order to annotate many different domain specific concepts. The Language Identification PR (Section 15.1) can also be trained to differentiate between document domains instead of languages, which could help target specific resources to specific documents using a conditional corpus pipeline.

Also many plugins can be used “as is” to extract information from biomedical documents. For example, the Measurements Tagger (Section 21.6) can be sued to extracting information about the dose of a medication, or the weight of patients in a study.

The rest of this section, however, documents the resources included with GATE which are focused purely on processing biomedical documents.

16.1.1 ABNER [#]

ABNER is A Biomedical Named Entity Recogniser [Settles 05]. It uses machine learning (linear-chain conditional random fields, CRFs) to find entities such as genes, cell types, and DNA in text. Full details of ABNER can be found at http://pages.cs.wisc.edu/ bsettles/abner/

To use ABNER within GATE, first load the Tagger_Abner plugin through the plugins console, and then create a new ABNER Tagger PR in the usual way. The ABNER Tagger PR has no initialization parameters and itt does not require any other PRs to be run prior to execution. Configuration of the tagger is performed using the following runtime parameters:

The tagger finds and annotates entities of the following types:

If an annotationName is specified then these types will appear as features on the created annotations, otherwise they will be used as the names of the annotations themselves.

ABNER does support training of models on other data, but this functionality is not, however, supported by the GATE wrapper.

For further details please refer to the ABNER documentation at http://pages.cs.wisc.edu/~bsettles/abner/

16.1.2 MetaMap [#]

MetaMap, from the National Library of Medicine (NLM), maps biomedical text to the UMLS Metathesaurus and allows Metathesaurus concepts to be discovered in a text corpus [Aronson & Lang 10].

The Tagger_MetaMap plugin for GATE wraps the MetaMap Java API client to allow GATE to communicate with a remote (or local) MetaMap PrologBeans mmserver10 and MetaMap distribution. This allows the content of specified annotations (or the entire document content) to be processed by MetaMap and the results converted to GATE annotations and features.

To use this plugin, you will need access to a remote MetaMap server, or install one locally by downloading and installing the complete distribution:

http://metamap.nlm.nih.gov/

and Java PrologBeans mmserver

http://metamap.nlm.nih.gov/README_javaapi.html

The default mmserver10 location and port locations are localhost and 8066. To use a different server location and/or port, see the above API documentation and specify the –metamap_server_host and –metamap_server_port options within the metaMapOptions run-time parameter.

Run-time parameters

  1. annotateNegEx: set this to true to add NegEx features to annotations (NegExType and NegExTrigger). See http://code.google.com/p/negex/ for more information on NegEx
  2. annotatePhrases: set to true to output MetaMap phrase-level annotations (generally noun-phrase chunks). Only phrases containing a MetaMap mapping will be annotated. Can be useful for post-coordination of phrase-level terms that do not exist in a pre-coordinated form in UMLS.
  3. inputASName: input Annotation Set name. Use in conjunction with inputASTypes: (see below). Unless specified, the entire document content will be sent to MetaMap.
  4. inputASTypes: only send the content of these annotations within inputASName to MetaMap and add new MetaMap annotations inside each. Unless specified, the entire document content will be sent to MetaMap.
  5. inputASTypeFeature: send the content of this feature within inputASTypes to MetaMap and wrap a new MetaMap annotation around each annotation in inputASTypes. If the feature is empty or does not exist, then the annotation content is sent instead.
  6. metaMapOptions: set parameter-less MetaMap options here. Default is -Xdt (truncate Candidates mappings, disallow derivational variants and do not use full text parsing). See http://metamap.nlm.nih.gov/README_javaapi.html for more details. NB: only set the -y parameter (word-sense disambiguation) if wsdserverctl is running.
  7. outputASName: output Annotation Set name.
  8. outputASType: output annotation name to be used for all MetaMap annotations
  9. outputMode: determines which mappings are output as annotations in the GATE document, for each phrase:
    • AllCandidatesAndMappings: annotate both Candidate and final mappings. This will usually result in multiple, overlapping annotations for each term/phrase
    • AllMappings: annotate all the final MetaMap Mappings for each phrase. This will result in fewer annotations with higher precision (e.g. for ’lung cancer’ only the complete phrase will be annotated as Neoplastic Process [neop])
    • HighestMappingOnly: annotate only the highest scoring MetaMap Mapping for each phrase. If two Mappings have the same score, the first returned by MetaMap is output.
    • HighestMappingLowestCUI: Where there is more than one highest-scoring mapping, return the mapping where the head word/phrase map event has the lowest CUI.
    • HighestMappingMostSources: Where there is more than one highest-scoring mapping, return the mapping where the head word/phrase map event has the highest number of source vocabulary occurrences.
    • AllCandidates: annotate all Candidate mappings and not the final Mappings. This will result in more annotations with less precision (e.g. for ’lung cancer’ both ’lung’ (bpoc) and ’lung cancer’ (neop) will be annotated).
  10. taggerMode: determines whether all term instances are processed by MetaMap, the first instance only, or the first instance with coreference annotations added. Only used if the inputASTypes parameter has been set.
    • FirstOccurrenceOnly: only process and annotate the first instance of each term in the document
    • CoReference: process and annotate the first instance and coreference following instances
    • AllOccurrences: process and annotate all term instances independently

16.1.3 AbGene [#]

Support for using AbGene [Tanabe & Wilbur 02] (a modified version of the Brill tagger), to annotate gene names, within GATE is provided by the Tagger Framework plugin (Section 21.3).

AbGene needs to be downloaded1 and installed externally to GATE and then the example AbGene GATE application, provided in the resources directory of the Tagger Framework plugin, needs to be modified accordingly.

16.1.4 GENIA [#]

A number of different biomedical language processing tools have been made developed under the auspices of the GENIA Project. Support is provided within GATE for using both the GENIA sentence splitter and the tagger, which provides tokenization, part-of-speech tagging, shallow parsing and named entity recognition.

To use either the GENIA sentence splitter2 or tagger3 within GATE you need to have downloaded and compiled the appropriate programs which can then be called by the GATE PRs.

The GATE GENIA plugin provides the sentence splitter PR. The PR is configured through the following runtime parameters:

Support for the GENIA tagger within GATE is handled by the Tagger Framework which is documented in Section 21.3.

Together these two components in a GATE pipeline provides a biomedical equivalent of ANNIE (minus the orthographic coreference component). Such a pipeline is provided as an example within the GENIA plugin4.

For more details on the GENIA tagger and it’s performance over biomedical text see [Tsuruoka et al. 05].

16.1.5 Penn BioTagger [#]

The Penn BioTagger software suite5 provides a biomedical tokenizer and three taggers for gene entities [McDonald & Pereira 05], genomic variations entities [McDonald et al. 04] and malignancy type entities [Jin et al. 06]. All four components are available within GATE via the Tagger_PennBio plugin.

The tokenizer PR is configured through two parameters, one init and one runtime, as follows:

All three taggers are configured in the same way, via one init parameter and two runtime parameters, as follows:

16.1.6 MutationFinder [#]

MutationFinder is a high-performance IE tool designed to extract mentions of point mutations from free text [Caporaso et al. 07].

The MutationFinder PR is configured via a single init parameter:

Once created the runtime behaviour of the PR can be controlled via the following runtime parameter:

16.1.7 NormaGene [#]

NormaGene is a web service, provided by the BiTeM group in Geneva. The service provides tools for both gene tagging and normalization, although currently only tagging is supported by this GATE wrapper.

The NormaGene Tagger PR is configured via two runtime parameters as follows:

1ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/AbGene/

2http://www-tsujii.is.s.u-tokyo.ac.jp/~y-matsu/geniass/

3http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

4The plugin contains a saved application, genia.xgapp, which includes both components. The runtime parameters of both components will need changing to point to your locally installed copies of the GENIA applications

5http://www.seas.upenn.edu/~strctlrn/BioTagger/BioTagger.html