Domain Speciﬁc Resources [#]
As soon as more-or-less faithful replication has evolved, then natural selection begins to work. To say this is not to invoke some magic principle, some deus ex machina; natural selection in this sense is a logical necessity, not a theory waiting to be proved. It is inevitable that those cells more eﬃcient at capturing and using energy, and of replicating more faithfully, would survive and their progeny spread; those less eﬃcient would tend to die out, their contents re-absorbed and used by others. Two great evolutionary processes occur simultaneously. The one, beloved by many popular science writers, is about competition, the struggle for existence between rivals. Darwin begins here, and orthodox Darwinians tend both to begin and end here. But the second process, less often discussed today, perhaps because less in accord with the spirit of the times, is about co-operation, the teaming up of cells with particular specialisms to work together. For example, one type of cell may evolve a set of enzymes enabling it to metabolise molecules produced as waste material by another. There are many such examples of symbiosis in today’s multitudinous world. Think, amongst the most obvious, of the complex relationships we have with the myriad bacteria – largely Escherichia coli – that inhabit our own guts, and without whose co-operation in our digestive processes we would be unable to survive. In extreme cases, cells with diﬀerent speciﬁc specialisms may even merge to form a single organism combining both, a process called symbiogenesis.
Symbiogenesis is now believed to have been the origin of mitochondria, the energy-converting structures present in all of today’s cells, as well as the photosynthesising chloroplasts present in green plants.
Stephen Rose, The Future of the Brain: The Promise and Perils of Tomorrow’s Neuroscience, 2005, (p. 18).
The majority of GATE plugins work well on any English languages document (see Chapter 15 for details on non-English language support). Some domains, however, produce documents that use unusual terms, phrases or syntax. In such cases domain speciﬁc processing resources are often required in order to extract useful or interesting information. This chapter documents GATE resources that have been developed for speciﬁc domains.
16.1 Biomedical Support [#]
Documents from the biomedical domain oﬀer a number of challenges, including a highly specialised vocabulary, words that include mixed case and numbers requiring unusual tokenization, as well as common English words used with a domain-speciﬁc sense. Many of these problems can only be solved through the use of domain-speciﬁc resources.
Some of the processing resources documented elsewhere in this user guide can be adapted with little or no eﬀort to help with processing biomedical documents. The Large Knowledge Base Gazetteer (Section 13.9) can be initialized against a biomedical ontology such as Linked Life Data in order to annotate many diﬀerent domain-speciﬁc concepts. The Language Identiﬁcation PR (Section 15.1) can also be trained to diﬀerentiate between document domains instead of languages, which could help target speciﬁc resources to speciﬁc documents using a conditional corpus pipeline.
Also many plugins can be used “as is” to extract information from biomedical documents. For example, the Measurements Tagger (Section 23.9) can be used to extract information about the dose of a medication, or the weight of patients participating in a study.
The rest of this section, however, documents the resources included with or available to GATE and which are focused purely on processing biomedical documents.
ABNER is A Biomedical Named Entity Recogniser [Settles 05]. It uses machine learning (linear-chain conditional random ﬁelds, CRFs) to ﬁnd entities such as genes, cell types, and DNA in text. Full details of ABNER can be found at http://pages.cs.wisc.edu/ bsettles/abner/
To use ABNER within GATE, ﬁrst load the Tagger_Abner plugin through the plugins console, and then create a new ABNER Tagger PR in the usual way. The ABNER Tagger PR has no initialization parameters and it does not require any other PRs to be run prior to execution. Conﬁguration of the tagger is performed using the following runtime parameters:
- abnerMode The ABNER model that will be used for tagging. The plugin can use one of
two previously trained machine learning models for tagging text, as provided by
- BIOCREATIVE trained on the BioCreative corpus
- NLPBA trained on the NLPBA corpus
- annotationName The name of the annotations the tagger should create (defaults to ‘Tagger’). If left blank (or null) the name of each annotation is determined by the type of entity discovered by ABNER (see below).
- outputASName The name of the annotation set in which new annotations will be created.
The tagger ﬁnds and annotates entities of the following types:
If an annotationName is speciﬁed then these types will appear as features on the created annotations, otherwise they will be used as the names of the annotations themselves.
ABNER does support training of models on other data, but this functionality is not, however, supported by the GATE wrapper.
For further details please refer to the ABNER documentation at http://pages.cs.wisc.edu/~bsettles/abner/
MetaMap, from the National Library of Medicine (NLM), maps biomedical text to the UMLS Metathesaurus and allows Metathesaurus concepts to be discovered in a text corpus [Aronson & Lang 10].
The Tagger_MetaMap plugin for GATE wraps the MetaMap Java API client to allow GATE to communicate with a remote (or local) MetaMap PrologBeans mmserver and MetaMap distribution. This allows the content of speciﬁed annotations (or the entire document content) to be processed by MetaMap and the results converted to GATE annotations and features.
To use this plugin, you will need access to a remote MetaMap server, or install one locally by downloading and installing the complete distribution:
and Java PrologBeans mmserver
The default mmserver location and port locations are localhost and 8066. To use a diﬀerent server location and/or port, see the above API documentation and specify the –metamap_server_host and –metamap_server_port options within the metaMapOptions run-time parameter.
- annotateNegEx: set this to true to add NegEx features to annotations (NegExType and NegExTrigger). See http://code.google.com/p/negex/ for more information on NegEx
- annotatePhrases: set to true to output MetaMap phrase-level annotations (generally noun-phrase chunks). Only phrases containing a MetaMap mapping will be annotated. Can be useful for post-coordination of phrase-level terms that do not exist in a pre-coordinated form in UMLS.
- inputASName: input Annotation Set name. Use in conjunction with inputASTypes: (see below). Unless speciﬁed, the entire document content will be sent to MetaMap.
- inputASTypes: only send the content of these annotations within inputASName to MetaMap and add new MetaMap annotations inside each. Unless speciﬁed, the entire document content will be sent to MetaMap.
- inputASTypeFeature: send the content of this feature within inputASTypes to MetaMap and wrap a new MetaMap annotation around each annotation in inputASTypes. If the feature is empty or does not exist, then the annotation content is sent instead.
- metaMapOptions: set parameter-less MetaMap options here. Default is -Xdt (truncate Candidates mappings, disallow derivational variants and do not use full text parsing). See http://metamap.nlm.nih.gov/README_javaapi.html for more details. NB: only set the -y parameter (word-sense disambiguation) if wsdserverctl is running.
- outputASName: output Annotation Set name.
- outputASType: output annotation name to be used for all MetaMap annotations
- outputMode: determines which mappings are output as annotations in the GATE
document, for each phrase:
- AllCandidatesAndMappings: annotate both Candidate and ﬁnal mappings. This will usually result in multiple, overlapping annotations for each term/phrase
- AllMappings: annotate all the ﬁnal MetaMap Mappings for each phrase. This will result in fewer annotations with higher precision (e.g. for ’lung cancer’ only the complete phrase will be annotated as Neoplastic Process [neop])
- HighestMappingOnly: annotate only the highest scoring MetaMap Mapping for each phrase. If two Mappings have the same score, the ﬁrst returned by MetaMap is output.
- HighestMappingLowestCUI: Where there is more than one highest-scoring mapping, return the mapping where the head word/phrase map event has the lowest CUI.
- HighestMappingMostSources: Where there is more than one highest-scoring mapping, return the mapping where the head word/phrase map event has the highest number of source vocabulary occurrences.
- AllCandidates: annotate all Candidate mappings and not the ﬁnal Mappings. This will result in more annotations with less precision (e.g. for ’lung cancer’ both ’lung’ (bpoc) and ’lung cancer’ (neop) will be annotated).
- taggerMode: determines whether all term instances are processed by MetaMap, the ﬁrst
instance only, or the ﬁrst instance with coreference annotations added. Only used if the
inputASTypes parameter has been set.
- FirstOccurrenceOnly: only process and annotate the ﬁrst instance of each term in the document
- CoReference: process and annotate the ﬁrst instance and coreference following instances
- AllOccurrences: process and annotate all term instances independently
This plugin wraps the GSpell API, from the National Library of Medicine Lexical Systems Group, to add spelling suggestions to features in the input/output annotations deﬁned (default is Token). The GSpell plugin has a number of options to customise the behaviour and to reduce the number of false positives in the spelling suggestions. For example, ignore words and spelling suggestions shorter than a given threshold, and regular expressions to ﬁlter the input to the spell checker. Two ﬁlters are provided by default: ignore capitalised abbreviations/words in all caps, and words starting or ending with a digit.
There are two processing modes: WholePhrase, which will spell-check the content of deﬁned annotations as a single phrase, and does not require any prior tokenization; and PhraseTokens, which requires a tokenizer to have been run as a prior phase.
The GSpell plugin can be downloaded from here.
BADREX (identifying Biomedical Abbreviations using Dynamic Regular Expressions)[Gooch 12] is a GATE plugin that annotates, expands and coreferences term-abbreviation pairs using parameterisable regular expressions that generalise and extend the Schwartz-Hearst algorithm [Schwartz & Hearst 03]. In addition it uses a subset of the inner–outer selection rules described in the [Ao & Takagi 05] ALICE algorithm. Rather than simply extracting terms and their abbreviations, it annotates them in situ and adds the corresponding long-form and short-form text as features on each.
In coreference mode BADREX expands all abbreviations in the text that match the short form of the most recently matched long-form–short-form pair. In addition, there is the option of annotating and classifying common medical abbreviations extracted from Wikipedia.
BADREX can be downloaded from GitHub.
16.1.5 MiniChem/Drug Tagger [#]
The MiniChem Tagger is a GATE plugin uses a small set ( 500) of chemistry morphemes classiﬁed into 10 types (root, suﬃx, multiplier etc), and some deterministic rules based on the Wikipedia IUPAC entries, to identify chemical names, drug names and chemical formula in text.
The plugin can be downloaded from here.
AbGene needs to be downloaded1 and installed externally to GATE and then the example AbGene GATE application, provided in the resources directory of the Tagger Framework plugin, needs to be modiﬁed accordingly.
A number of diﬀerent biomedical language processing tools have been developed under the auspices of the GENIA Project. Support is provided within GATE for using both the GENIA sentence splitter and the tagger, which provides tokenization, part-of-speech tagging, shallow parsing and named entity recognition.
The GATE GENIA plugin provides the sentence splitter PR. The PR is conﬁgured through the following runtime parameters:
- annotationSetName the name of the annotation set in which the Sentence annotations should be created
- debug if true then details of calling the external process will be reported within the message pane
- splitterBinary the location of the GENIA sentence slitter binary
Support for the GENIA tagger within GATE is handled by the Tagger Framework which is documented in Section 23.3.
Together these two components in a GATE pipeline provides a biomedical equivalent of ANNIE (minus the orthographic coreference component). Such a pipeline is provided as an example within the GENIA plugin4.
For more details on the GENIA tagger and its performance over biomedical text see [Tsuruoka et al. 05].
16.1.8 Penn BioTagger [#]
The Penn BioTagger software suite5 provides a biomedical tokenizer and three taggers for gene entities [McDonald & Pereira 05], genomic variations entities [McDonald et al. 04] and malignancy type entities [Jin et al. 06]. All four components are available within GATE via the Tagger_PennBio plugin.
The tokenizer PR is conﬁgured through two parameters, one init and one runtime, as follows:
- tokenizerURL this init parameter speciﬁes the location of the tokenizer model to use (the default value points to the model distributed with the Penn BioTagger suite)
- annotationSetName this runtime parameter determines the annotation set in which Token annotations will be created
All three taggers are conﬁgured in the same way, via one init parameter and two runtime parameters, as follows:
- modelURL the location of the model used by the tagger
- inputASName the annotation set to use as input to the tagger (must contain Token annotations)
- outputASName the annotation set in which new annotations are created via the tagger
16.1.9 MutationFinder [#]
The MutationFinder PR is conﬁgured via a single init parameter:
- regexURL this init parameter speciﬁes the location of the regular expression ﬁle used by MutationFinder. Note that the default value points to the ﬁle supplied with MutationFinder.
Once created the runtime behaviour of the PR can be controlled via the following runtime parameter:
- annotationSetName the name of the annotation set in which the Mutation annotations should be created
NormaGene is a web service, provided by the BiTeM group in Geneva. The service provides tools for both gene tagging and normalization, although currently only tagging is supported by this GATE wrapper.
The NormaGene Tagger PR is conﬁgured via two runtime parameters as follows:
- annotationSetName the name of the annotation set in which the Gene annotations should be created.
- threshold the threshold at which an entity will be considered a gene (defaults to 0.6). Minimize the threshold parameter with short text input to receive better results. Tuning the threshold down helps to ﬁnd more complex gene names in the text but it also increases the time taken to process the text.
4The plugin contains a saved application, genia.xgapp, which includes both components. The runtime parameters of both components will need changing to point to your locally installed copies of the GENIA applications