Chapter 16
Domain Specific Resources [#]
As soon as more-or-less faithful replication has evolved, then natural selection begins to work. To say this is not to invoke some magic principle, some deus ex machina; natural selection in this sense is a logical necessity, not a theory waiting to be proved. It is inevitable that those cells more efficient at capturing and using energy, and of replicating more faithfully, would survive and their progeny spread; those less efficient would tend to die out, their contents re-absorbed and used by others. Two great evolutionary processes occur simultaneously. The one, beloved by many popular science writers, is about competition, the struggle for existence between rivals. Darwin begins here, and orthodox Darwinians tend both to begin and end here. But the second process, less often discussed today, perhaps because less in accord with the spirit of the times, is about co-operation, the teaming up of cells with particular specialisms to work together. For example, one type of cell may evolve a set of enzymes enabling it to metabolise molecules produced as waste material by another. There are many such examples of symbiosis in today’s multitudinous world. Think, amongst the most obvious, of the complex relationships we have with the myriad bacteria – largely Escherichia coli – that inhabit our own guts, and without whose co-operation in our digestive processes we would be unable to survive. In extreme cases, cells with different specific specialisms may even merge to form a single organism combining both, a process called symbiogenesis.
Symbiogenesis is now believed to have been the origin of mitochondria, the energy-converting structures present in all of today’s cells, as well as the photosynthesising chloroplasts present in green plants.
Stephen Rose, The Future of the Brain: The Promise and Perils of Tomorrow’s Neuroscience, 2005, (p. 18).
The majority of GATE plugins work well on any English languages document (see Chapter 15 for details on non-English language support). Some domains, however, produce documents that use unusual terms, phrases or syntax. In such cases domain specific processing resources are often required in order to extract useful or interesting information. This chapter documents GATE resources that have been developed for specific domains.
16.1 Biomedical Support [#]
Documents from the biomedical domain offer a number of challenges, including a highly specialised vocabulary, words that include mixed case and numbers requiring unusual tokenization, as well as common English words used with a domain-specific sense. Many of these problems can only be solved through the use of domain-specific resources.
Some of the processing resources documented elsewhere in this user guide can be adapted with little or no effort to help with processing biomedical documents. The Large Knowledge Base Gazetteer (Section 13.9) can be initialized against a biomedical ontology such as Linked Life Data in order to annotate many different domain-specific concepts. The Language Identification PR (Section 15.1) can also be trained to differentiate between document domains instead of languages, which could help target specific resources to specific documents using a conditional corpus pipeline.
Also many plugins can be used “as is” to extract information from biomedical documents. For example, the Measurements Tagger (Section 23.9) can be used to extract information about the dose of a medication, or the weight of patients participating in a study.
The rest of this section, however, documents the resources included with or available to GATE and which are focused purely on processing biomedical documents.
16.1.1 ABNER [#]
ABNER is A Biomedical Named Entity Recogniser [Settles 05]. It uses machine learning (linear-chain conditional random fields, CRFs) to find entities such as genes, cell types, and DNA in text. Full details of ABNER can be found at http://pages.cs.wisc.edu/ bsettles/abner/
To use ABNER within GATE, first load the Tagger_Abner plugin through the plugins console, and then create a new ABNER Tagger PR in the usual way. The ABNER Tagger PR has no initialization parameters and it does not require any other PRs to be run prior to execution. Configuration of the tagger is performed using the following runtime parameters:
- abnerMode The ABNER model that will be used for tagging. The plugin can use one of
two previously trained machine learning models for tagging text, as provided by
ABNER:
- BIOCREATIVE trained on the BioCreative corpus
- NLPBA trained on the NLPBA corpus
- annotationName The name of the annotations the tagger should create (defaults to ‘Tagger’). If left blank (or null) the name of each annotation is determined by the type of entity discovered by ABNER (see below).
- outputASName The name of the annotation set in which new annotations will be created.
The tagger finds and annotates entities of the following types:
- Protein
- DNA
- RNA
- CellLine
- CellType
If an annotationName is specified then these types will appear as features on the created annotations, otherwise they will be used as the names of the annotations themselves.
ABNER does support training of models on other data, but this functionality is not, however, supported by the GATE wrapper.
For further details please refer to the ABNER documentation at http://pages.cs.wisc.edu/~bsettles/abner/
16.1.2 MetaMap [#]
MetaMap, from the National Library of Medicine (NLM), maps biomedical text to the UMLS Metathesaurus and allows Metathesaurus concepts to be discovered in a text corpus [Aronson & Lang 10].
The Tagger_MetaMap plugin for GATE wraps the MetaMap Java API client to allow GATE to communicate with a remote (or local) MetaMap PrologBeans mmserver and MetaMap distribution. This allows the content of specified annotations (or the entire document content) to be processed by MetaMap and the results converted to GATE annotations and features.
To use this plugin, you will need access to a remote MetaMap server, or install one locally by downloading and installing the complete distribution:
and Java PrologBeans mmserver
http://metamap.nlm.nih.gov/README_javaapi.html
The default mmserver location and port locations are localhost and 8066. To use a different server location and/or port, see the above API documentation and specify the –metamap_server_host and –metamap_server_port options within the metaMapOptions run-time parameter.
Run-time parameters
- annotateNegEx: set this to true to add NegEx features to annotations (NegExType and NegExTrigger). See http://code.google.com/p/negex/ for more information on NegEx
- annotatePhrases: set to true to output MetaMap phrase-level annotations (generally noun-phrase chunks). Only phrases containing a MetaMap mapping will be annotated. Can be useful for post-coordination of phrase-level terms that do not exist in a pre-coordinated form in UMLS.
- inputASName: input Annotation Set name. Use in conjunction with inputASTypes: (see below). Unless specified, the entire document content will be sent to MetaMap.
- inputASTypes: only send the content of these annotations within inputASName to MetaMap and add new MetaMap annotations inside each. Unless specified, the entire document content will be sent to MetaMap.
- inputASTypeFeature: send the content of this feature within inputASTypes to MetaMap and wrap a new MetaMap annotation around each annotation in inputASTypes. If the feature is empty or does not exist, then the annotation content is sent instead.
- metaMapOptions: set parameter-less MetaMap options here. Default is -Xdt (truncate Candidates mappings, disallow derivational variants and do not use full text parsing). See http://metamap.nlm.nih.gov/README_javaapi.html for more details. NB: only set the -y parameter (word-sense disambiguation) if wsdserverctl is running.
- outputASName: output Annotation Set name.
- outputASType: output annotation name to be used for all MetaMap annotations
- outputMode: determines which mappings are output as annotations in the GATE
document, for each phrase:
- AllCandidatesAndMappings: annotate both Candidate and final mappings. This will usually result in multiple, overlapping annotations for each term/phrase
- AllMappings: annotate all the final MetaMap Mappings for each phrase. This will result in fewer annotations with higher precision (e.g. for ’lung cancer’ only the complete phrase will be annotated as Neoplastic Process [neop])
- HighestMappingOnly: annotate only the highest scoring MetaMap Mapping for each phrase. If two Mappings have the same score, the first returned by MetaMap is output.
- HighestMappingLowestCUI: Where there is more than one highest-scoring mapping, return the mapping where the head word/phrase map event has the lowest CUI.
- HighestMappingMostSources: Where there is more than one highest-scoring mapping, return the mapping where the head word/phrase map event has the highest number of source vocabulary occurrences.
- AllCandidates: annotate all Candidate mappings and not the final Mappings. This will result in more annotations with less precision (e.g. for ’lung cancer’ both ’lung’ (bpoc) and ’lung cancer’ (neop) will be annotated).
- taggerMode: determines whether all term instances are processed by MetaMap, the first
instance only, or the first instance with coreference annotations added. Only used if the
inputASTypes parameter has been set.
- FirstOccurrenceOnly: only process and annotate the first instance of each term in the document
- CoReference: process and annotate the first instance and coreference following instances
- AllOccurrences: process and annotate all term instances independently
16.1.3 GSpell biomedical spelling suggestion and correction [#]
This plugin wraps the GSpell API, from the National Library of Medicine Lexical Systems Group, to add spelling suggestions to features in the input/output annotations defined (default is Token). The GSpell plugin has a number of options to customise the behaviour and to reduce the number of false positives in the spelling suggestions. For example, ignore words and spelling suggestions shorter than a given threshold, and regular expressions to filter the input to the spell checker. Two filters are provided by default: ignore capitalised abbreviations/words in all caps, and words starting or ending with a digit.
There are two processing modes: WholePhrase, which will spell-check the content of defined annotations as a single phrase, and does not require any prior tokenization; and PhraseTokens, which requires a tokenizer to have been run as a prior phase.
The GSpell plugin can be downloaded from here.
16.1.4 BADREX [#]
BADREX (identifying Biomedical Abbreviations using Dynamic Regular Expressions)[Gooch 12] is a GATE plugin that annotates, expands and coreferences term-abbreviation pairs using parameterisable regular expressions that generalise and extend the Schwartz-Hearst algorithm [Schwartz & Hearst 03]. In addition it uses a subset of the inner–outer selection rules described in the [Ao & Takagi 05] ALICE algorithm. Rather than simply extracting terms and their abbreviations, it annotates them in situ and adds the corresponding long-form and short-form text as features on each.
In coreference mode BADREX expands all abbreviations in the text that match the short form of the most recently matched long-form–short-form pair. In addition, there is the option of annotating and classifying common medical abbreviations extracted from Wikipedia.
BADREX can be downloaded from GitHub.
16.1.5 MiniChem/Drug Tagger [#]
The MiniChem Tagger is a GATE plugin uses a small set ( 500) of chemistry morphemes classified into 10 types (root, suffix, multiplier etc), and some deterministic rules based on the Wikipedia IUPAC entries, to identify chemical names, drug names and chemical formula in text.
The plugin can be downloaded from here.
16.1.6 AbGene [#]
Support for using AbGene [Tanabe & Wilbur 02] (a modified version of the Brill tagger), to annotate gene names, within GATE is provided by the Tagger Framework plugin (Section 23.3).
AbGene needs to be downloaded1 and installed externally to GATE and then the example AbGene GATE application, provided in the resources directory of the Tagger Framework plugin, needs to be modified accordingly.
16.1.7 GENIA [#]
A number of different biomedical language processing tools have been developed under the auspices of the GENIA Project. Support is provided within GATE for using both the GENIA sentence splitter and the tagger, which provides tokenization, part-of-speech tagging, shallow parsing and named entity recognition.
To use either the GENIA sentence splitter2 or tagger3 within GATE you need to have downloaded and compiled the appropriate programs which can then be called by the GATE PRs.
The GATE GENIA plugin provides the sentence splitter PR. The PR is configured through the following runtime parameters:
- annotationSetName the name of the annotation set in which the Sentence annotations should be created
- debug if true then details of calling the external process will be reported within the message pane
- splitterBinary the location of the GENIA sentence slitter binary
Support for the GENIA tagger within GATE is handled by the Tagger Framework which is documented in Section 23.3.
Together these two components in a GATE pipeline provides a biomedical equivalent of ANNIE (minus the orthographic coreference component). Such a pipeline is provided as an example within the GENIA plugin4.
For more details on the GENIA tagger and its performance over biomedical text see [Tsuruoka et al. 05].
16.1.8 Penn BioTagger [#]
The Penn BioTagger software suite5 provides a biomedical tokenizer and three taggers for gene entities [McDonald & Pereira 05], genomic variations entities [McDonald et al. 04] and malignancy type entities [Jin et al. 06]. All four components are available within GATE via the Tagger_PennBio plugin.
The tokenizer PR is configured through two parameters, one init and one runtime, as follows:
- tokenizerURL this init parameter specifies the location of the tokenizer model to use (the default value points to the model distributed with the Penn BioTagger suite)
- annotationSetName this runtime parameter determines the annotation set in which Token annotations will be created
All three taggers are configured in the same way, via one init parameter and two runtime parameters, as follows:
- modelURL the location of the model used by the tagger
- inputASName the annotation set to use as input to the tagger (must contain Token annotations)
- outputASName the annotation set in which new annotations are created via the tagger
16.1.9 MutationFinder [#]
MutationFinder is a high-performance IE tool designed to extract mentions of point mutations from free text [Caporaso et al. 07].
The MutationFinder PR is configured via a single init parameter:
- regexURL this init parameter specifies the location of the regular expression file used by MutationFinder. Note that the default value points to the file supplied with MutationFinder.
Once created the runtime behaviour of the PR can be controlled via the following runtime parameter:
- annotationSetName the name of the annotation set in which the Mutation annotations should be created
1ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/AbGene/
2http://www-tsujii.is.s.u-tokyo.ac.jp/~y-matsu/geniass/
3http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
4The plugin contains a saved application, genia.xgapp, which includes both components. The runtime parameters of both components will need changing to point to your locally installed copies of the GENIA applications
5http://www.seas.upenn.edu/~strctlrn/BioTagger/BioTagger.html