GATE.ac.uk - sale/tao/splitch13.html

Chapter 13
Gazetteers [#]

13.1 Introduction to Gazetteers [#]

A gazetteer consists of a set of lists containing names of entities such as cities, organisations, days of the week, etc. These lists are used to ﬁnd occurrences of these names in text, e.g. for the task of named entity recognition. The word ‘gazetteer’ is often used interchangeably for both the set of entity lists and for the processing resource that makes use of those lists to ﬁnd occurrences of the names in text.

When a gazetteer processing resource is run on a document, annotations of type Lookup are created for each matching string in the text. Gazetteers usually do not depend on Tokens or on any other annotation and instead ﬁnd matches based on the textual content of the document. (the Flexible Gazetteer, described in section 13.6, being the exception to the rule). This means that an entry may span more than one word and may start or end within a word. If a gazetteer that directly works on text does respect word boundaries, the way how word boundaries are found might diﬀer from the way the GATE tokeniser ﬁnds word boundaries. A Lookup annotation will only be created if the entire gazetteer entry is matched in the text. The details of how gazetteer entries match text depend on the gazetteer processing resource and its parameters. In this chapter, we will cover several gazetteers.

13.2 ANNIE Gazetteer [#]

The rest of this introductory section describes the ANNIE Gazetteer which is part of ANNIE and also described in section 6.3. The ANNIE gazetteer is part of and provided by the ANNIE plugin.

Each individual gazetteer list is a plain text ﬁle, with one entry per line.

Below is a section of the list for units of currency:

 Ecu
 European Currency Units
 FFr
 Fr
 German mark
 German marks
 New Taiwan dollar
 New Taiwan dollars
 NT dollar
 NT dollars

An index ﬁle (usually called lists.def) is used to describe all such gazetteer list ﬁles that belong together. Each gazetteer list should reside in the same directory as the index ﬁle.

The gazetteer index ﬁles describes for each list the major type and optionally, a minor type, a language and an annotation type, separated by colons. In the example below, the ﬁrst column refers to the list name, the second column to the major type, the third to the minor type, the fourth column to the language and the ﬁfth column to the annotation type. These lists are compiled into ﬁnite state machines. Any text strings matched by these machines will be annotated with features specifying the major and minor types.

currency_prefix.lst:currency_unit:pre_amount
currency_unit.lst:currency_unit:post_amount
date.lst:date:specific_date::Date
day.lst:date:day
monthen.lst:date:month:en
monthde.lst:date:month:de
season.lst:date:season

The major and minor type as well as the language will be added as features to only Lookup annotation generated from a matching entry from the respective list. For example, if an entry from the currency_unit.lst gazetteer list matches some text in a document, the gazetteer processing resource will generate a Lookup annotation spanning the matching text and assign the features major="currency_unit" and minor="post_amount" to that annotation.

By default the ANNIE Gazetteer PR creates Lookup annotations. However, if a user has speciﬁed a speciﬁc annotation type for a list, the Gazetteer uses the speciﬁed annotation type to annotate entries that are part of the speciﬁed list and appear in the document being processed.

Grammar rules (JAPE rules) can specify the types to be identiﬁed in particular circumstances. The major and minor types enable this identiﬁcation to take place, by giving access to items stored in particular lists or combinations of lists.

For example, if a day needs to be identiﬁed, the minor type ‘day’ would be speciﬁed in the grammar, in order to match only information about speciﬁc days. If any kind of date needs to be identiﬁed, the major type ‘date’ would be speciﬁed. This might include weeks, months, years etc. as well as days of the week, and would give access to all the items stored in day.lst, month.lst, season.lst, and date.lst in the example shown.

13.2.1 Creating and Modifying Gazetteer Lists

Gazetteer lists can be modiﬁed using any text editor or an editor inside GATE when you double-click on the gazetteer in the resources tree. Use of an editor that can edit Unicode UTF-8 ﬁles (e.g. the GATE Unicode editor) is advised, however, in order to ensure that the lists are stored as UTF-8, which will minimise any language encoding problems, particularly if e.g. accents, umlauts or characters from non-Latin scripts are present.

To create a new list, simply add an entry for that list to the deﬁnitions ﬁle and add the new list in the same directory as the existing lists.

After any modiﬁcations have been made in an external editor, ensure that you reinitialise the gazetteer PR in GATE, if one is already loaded, before rerunning your application.

13.2.2 ANNIE Gazetteer Editor [#]

Figure 13.1: ANNIE Gazetteer Editor

To open this edior, double-click on the gazetteer in the resources tree.

It is composed of two tables:

a left table with 5 columns (List name, Major, Minor, Language, Annotation type) for the index, usually a .def ﬁle
a right table with 1+2*n columns (Value, Feature 1, Value 1...Feature n, Value n) for the lists, usually .lst ﬁles

When selecting a list in the left table you get its content displayed in the right table.

You can sort both tables by clicking on their column headers. A text ﬁeld ‘Filter’ at the bottom of the right table allows to display only the rows that contain the expression you typed.

To edit a value in a table, double click on a cell or press F2 then press Enter when ﬁnished editing the cell. To add a new row in both tables use the text ﬁeld at the top and press Enter or use the ‘New’ button next to it. When adding a new list you can select from the list of existing gazetteer lists in the current directory or type a new ﬁle name. To delete a row, press Shift+Delete or use the context menu. To delete more than one row select them before.

You can reload a modiﬁed list by selecting it and right-clicking for the context menu item ‘Reload List’ or by pressing Control+R. When a list is modiﬁed its name in the left table is coloured in red.

If you have set ‘gazetteerFeatureSeparator’ parameter then the right table will show a ‘Feature’ and ‘Value’ columns for each feature. To add a new couple of columns use the button ‘Add Cols’.

Note that in the left table, you can only select one row at a time.

The gazetteer like other language resource has a context menu in the resources tree to ‘Reinitialise’, ‘Save’ or ‘Save as...’ the resource.

The right table has a context menu for the current selection to help you creating new gazetteer. It is similar with the actions found in a spreadsheet application like ‘Fill Down Selection’, ‘Clear Selection’, ‘Copy Selection’, ‘Paste Selection’, etc.

13.3 OntoGazetteer [#]

The Ontogazetteer, or Hierarchical Gazetteer, is a processing resource which can associate the entities from a speciﬁc gazetteer list with a class in a GATE ontology language resource. The OntoGazetteer assigns classes rather than major or minor types, and is aware of mappings between lists and class IDs. The Gaze visual resource can display the lists, ontology mappings and the class hierarchy of the ontology for a OntoGazetteer processing resource and provides ways of editing these components.

13.4 Gaze Ontology Gazetteer Editor [#]

This section describes the Gaze gazetteer editor when it displays an OntoGazetteer processing resource. The editor consists of two parts: one for the editing of the lists and the mapping of lists and one for editing the ontology. These two parts are described in the following subsections.

13.4.1 The Gaze Gazetteer List and Mapping Editor

This is a VR for editing the gazetteer lists, and mapping them to classes in an ontology. It provides load/store/edit for the lists, load/store/edit for the mapping information, loading of ontologies, load/store/edit for the linear deﬁnition ﬁle, and mapping of the lists ﬁle to the major type, minor type and language.

Left pane: A single ontology is visualized in the left pane of the VR. The mapping between a list and a class is displayed by showing the list as a subclass with a diﬀerent icon. The mapping is speciﬁed by drag and drop from the linear deﬁnition pane (in the middle) and/or by right click menu.

Middle pane: The middle pane displays the nodes/lines in the linear deﬁnition ﬁle. By double clicking on a node the corresponding list is opened. Editing of the line/node is done by right clicking and choosing edit: a dialogue appears (lower part of the scheme) allowing the modiﬁcation of the members of the node.

Right pane: In the right pane a single gazetteer list is displayed. It can be edited and parts of it can be cut/copied/pasted.

13.4.2 The Gaze Ontology Editor

Note: to edit ontologies within gate, the more recent ontology viewer editor provided by the Ontology_Tools which provides many more features can be used, see section 14.4.

This is a VR for editing the class hierarchy of an ontology. it provides storing to and loading from RDF/RDFS, and provides load/edit/store of the class hierarchy of an ontology.

Left pane: The various ontologies loaded are listed here. On double click or right click and edit from the menu the ontology is visualized in the Right pane.

Right pane: Besides the visualization of the class hierarchy of the ontology the following operations are allowed:

expanding/collapsing parts of the ontology
adding a class in the hierarchy: by right clicking on the intended parent of the new class and choosing add sub class.
removing a class: via right clicking on the class and choosing remove.

As a result of this VR, the ontology deﬁnition ﬁle is aﬀected/altered.

13.5 Hash Gazetteer [#]

The Hash Gazetteer is a gazetteer implemented by the OntoText Lab ( http://www.ontotext.com/). Its implementation is based on simple lookup in several java.util.HashMap objects, and is inspired by the strange idea of Atanas Kiryakov, that searching in HashMaps may be faster than in a Finite State Machine (FSM). The Hash Gazetteer processing resource is part of the ANNIE plugin.

This gazetteer processing resource is implemented in the following way: Every phrase i.e. every list entry is separated into several parts. The parts are determined by the whitespaces lying among them; e.g., the phrase “form is emptiness” has three parts: “form”, “is”, and “emptiness”. There is also a list of HashMaps: mapsList which has as many elements as the longest (in terms of ‘count of parts’) phrase in the lists. So the ﬁrst part of a phrase is placed in the ﬁrst map. The ﬁrst part + space + second part is placed in the second map, etc. The full phrase is placed in the appropriate map, and a reference to a Lookup object is attached to it.

On ﬁrst sight it seems that this algorithm is certainly much more memory-consuming than a ﬁnite state machine (FSM) with the parts of the phrases as transitions, but this is actually not so important since the average length of the phrases (in parts) in the lists is 1.1. On the other hand, one advantage of the algorithm is that, although unconventional, it takes less memory and may be slightly faster, especially if you have a very large gazetteer (e.g., 100,000s of entries).

13.5.1 Prerequisites

The phrases to be recognised should be listed in a set of ﬁles, one for each type of occurrence (as for the standard gazetteer).

The gazetteer is built with the information from a ﬁle that contains the set of lists (which are ﬁles as well) and the associated type for each list. The ﬁle deﬁning the set of lists should have the following syntax: each list deﬁnition should be written on its own line and should contain:

the ﬁle name (required)
the major type (required)
the minor type (optional)
the language(s) (optional)

The elements of each deﬁnition are separated by ‘:’. The following is an example of a valid deﬁnition:

personmale.lst:person:male:english

Each ﬁle named in the lists deﬁnition ﬁle is just a list containing one entry per line.

When this gazetteer is run over some input text (a GATE document) it will generate annotations of type Lookup having the attributes speciﬁed in the deﬁnition ﬁle.

13.5.2 Parameters

The Hash Gazetteer processing resource allows the speciﬁcation of the following parameters when it is created:

caseSensitive:: this can be switched between true and false to indicate if matches should be done in a case-sensitive way.
encoding:: the encoding of the gazetteer lists
listsURL:: the URL of the list deﬁnitions (index) ﬁle, i.e. the ﬁle that contains the ﬁlenames, major types and optionally minor types and languages of all the list ﬁles.

There is one run-time parameter, annotationSetName that allows the speciﬁcation of the annotation set in which the Lookup annotations will be created. If nothing is speciﬁed the default annotation set will be used.

Note that the Hash Gazetteer does not have the longestMatchOnly and wholeWordsOnly parameters; if you need to conﬁgure these options, you should use the another gazetteer that supports them, such as the standard ANNIE Gazetteer (see section 13.2).

13.6 Flexible Gazetteer [#]

The Flexible Gazetteer provides users with the ﬂexibility to choose their own customized input and an external Gazetteer. For example, the user might want to replace words in the text with their base forms (which is an output of the Morphological Analyser) before running the Gazetteer.

The Flexible Gazetteer performs lookup over a document based on the values of an arbitrary feature of an arbitrary annotation type, by using an externally provided gazetteer. It is important to use an external gazetteer as this allows the use of any type of gazetteer (e.g. an Ontological gazetteer).

Input to the Flexible Gazetteer:

Runtime parameters:

Document – the document to be processed
inputASName The annotationSet where the Flexible Gazetteer should search for the AnnotationType.feature speciﬁed in the inputFeatureNames.
outputASName The AnnotationSet where Lookup annotations should be placed.
Creation time parameters:
inputFeatureNames – when selected, these feature values are used to replace the corresponding original text. For each feature, a temporary document is created from the values of the speciﬁed features on the speciﬁed annotation types. For example: for Token.root the temporary document will have content of every Token replaced with its root value. In case of overlapping annotations of the same type in the input, only the value of the ﬁrst annotation is considered. Here, please note that the order of annotations is decided by using the gate.util.OﬀsetComparator class.
gazetteerInst – the actual gazetteer instance, which should run over a temporary document. This generates the Lookup annotations with features. This must be an instance of gate.creole.gazetteer.Gazetteer which has already been created. All such instances will be shown in the dropdown menu for this parameter in GATE Developer.

Once the external gazetteer has annotated text with Lookup annotations, Lookup annotations on the temporary document are converted to Lookup annotations on the original document. Finally the temporary document is deleted.

13.7 Gazetteer List Collector [#]

The gazetteer list collector, found in the Tools plugin, collects occurrences of entities directly from a set of annotated training documents and populates gazetteer lists with the entities. The entity types and structure of the gazetteer lists are deﬁned as necessary by the user. Once the lists have been collected, a semantic grammar can be used to ﬁnd the same entities in new texts.

The target gazetteer must contain a list corresponding exactly to each annotation type to be collection (for example, Person.lst for the Person annotations, Organization.lst for the Organization annotations, etc.). You can use the gazetteer editor to create new empty lists for types that are not already in your gazetteer. Note that if you do this, you will need to “Save and Reinitialise” the gazetteer later (the collector updates the *.lst ﬁles on disk, but not the lists.def ﬁle).

If a list in the gazetteer already contains entries, the collector will add new entries, but it will only collect one occurrence of each new entry; it checks that the entry is not present already before adding it.

There are 4 runtime parameters:

annotationTypes: a list of the annotation types that should be collected
gazetteer: the gazetteer where the results will be stored (this must be already loaded in GATE)
markupASname: the annotation set from which the annotation types should be collected
theLanguage: sets the language feature of the gazetteer lists to be created to the appropriate language (in the case where lists are collected for diﬀerent languages)

Figure 13.2 shows a screenshot of a set of lists collected automatically for the Hindi language. It contains 4 lists: Person, Organisation, Location and a list of stopwords. Each list has a majorType whose value is the type of list, a minorType ‘inferred’ (since the lists have been inferred from the text), and the language ‘Hindi’.

Figure 13.2: Lists collected automatically for Hindi

The list collector also has a facility to split the Person names that it collects into their individual tokens, so that it adds both the entire name to the list, and adds each of the tokens to the list (i.e. each of the ﬁrst names, and the surname) as a separate entry. When the grammar annotates Persons, it can require them to be at least 2 tokens or 2 consecutive Person Lookups. In this way, new Person names can be recognised by combining a known ﬁrst name with a known surname, even if they were not in the training corpus. Where only a single token is found that matches, an Unknown entity is generated, which can later be matched with an existing longer name via the orthomatcher component which performs orthographic coreference between named entities. This same procedure can also be used for other entity types. For example, parts of Organisation names can be combined together in diﬀerent ways. The facility for splitting Person names is hardcoded in the ﬁle gate/src/gate/creole/GazetteerListsCollector.java and is commented.

13.8 OntoRoot Gazetteer [#]

OntoRoot Gazetteer is a type of a dynamically created gazetteer that is, in combination with few other generic GATE resources, capable of producing ontology-based annotations over the given content with regards to the given ontology. This gazetteer is a part of ‘Gazetteer_Ontology_Based’ plugin that has been developed as a part of the TAO project.

13.8.1 How Does it Work? [#]

To produce ontology-based annotations i.e. annotations that link to the speciﬁc concepts or relations from the ontology, it is essential to pre-process the Ontology Resources (e.g., Classes, Instances, Properties) and extract their human-understandable lexicalisations.

As a precondition for extracting human-understandable content from the ontology, ﬁrst a list of the following is being created:

names of all ontology resources i.e. fragment identiﬁers ¹ and
assigned property values for all ontology resources (e.g., label and datatype property values)

Each item from the list is further processed so that:

any name containing dash ("-") or underline ("_") character(s) is processed so that each of these characters is replaced by a blank space. For example, Project_Name or Project-Name would become a Project Name.
any name that is written in camelCase style is actually split into its constituent words, so that ProjectName becomes a Project Name (optional).
any name that is a compound name such as ‘POS Tagger for Spanish’ is split so that both ‘POS Tagger’ and ‘Tagger’ are added to the list for processing. In this example, ‘for’ is a stop word, and any words after it are ignored (optional).

Each item from this list is analysed separately by the Onto Root Application (ORA) on execution (see ﬁgure 13.3). The Onto Root Application ﬁrst tokenises each linguistic term, then assigns part-of-speech and lemma information to each token.

Figure 13.3: Building Ontology Resource Root (OntoRoot) Gazetteer from the Ontology

As a result of that pre-processing, each token in the terms will have additional feature named ‘root’, which contains the lemma as created by the morphological analyser. It is this lemma or a set of lemmas which are then added to the dynamic gazetteer list, created from the ontology.

For instance, if there is a resource with a short name (i.e., fragment identiﬁer) ProjectName, without any assigned properties the created list before executing the OntoRoot gazetteer collection will contain the following strings:

‘ProjectName’,
‘Project Name’ after separating camelCased word and
‘Name’ after applying heuristic rules.

Each of the item from the list is then analysed separately and the results would be the same as the input strings, as all of entries are nouns given in singular form.

13.8.2 Initialisation of OntoRoot Gazetteer [#]

To initialise the gazetteer there are few mandatory parameters:

Ontology to be processed;
CorpusController to process the ontology terms². This application will be run on a document that contains a single Sentence annotation spanning the whole document, and is expected to produce annotations of type Token in the default annotation set, with features category (the POS tag) and root (the morphological root). Typically this pipeline would contain a tokeniser appropriate to the source language, a POS tagger, and a GATE Morphological Analyser PR, but any application that will produce the right annotation types and features will work. For example, when processing non-English text you may need to use an alternative POS tagger such as the Stanford tagger or TreeTagger.

and few optional ones:

useResourceUri, default is set to true - should this gazetteer analyse resource URIs or not;
considerProperties, default is set to true - should this gazetteer consider properties or not;
propertiesToInclude - checked only if considerProperties is set to true - this parameter contains the list of property names (URIs) to be included, comma separated;
propertiesToExclude - checked only if considerProperties is set to true - this parameter contains the list of property names to be excluded, comma separated;
caseSensitive, default set to be false -should this gazetteer diﬀerentiate on case;
separateCamelCasedWords, default set to true - should this gazetteer separate emphcamelCased words, e.g. ‘ProjectName’ into ‘Project Name’;
considerHeuristicRules, default set to false - should this gazetteer consider several heuristic rules or not. Rules include splitting the words containing spaces, and using prepositions as stop words; for example, if ’pos tagger for Spanish’ would be analysed, ‘for’ would be considered as a stop word; heuristically derived would be ‘pos tagger’ and this would be further used to add ‘pos tagger’ to the gazetteer list, with a feature emphheuristical level set to be 0, and ‘tagger’ with emphheuristical level 1; at runtime lower heuristical level should be preferred. NOTE: setting considerHeuristicRules to true can cause a lot of noise for some ontologies and is likely to require implementing an additional ﬁltering resource that will prefer the annotations with the lower heuristic level;

The OntoRoot Gazetteer’s initialization preprocesses strings from the ontology and runs the root ﬁnder application over them. It is possible to re-use the same tokeniser, POS tagger and morphological analyser PR instances in both the root ﬁnder application and the main pipeline that will contain the ﬁnished OntoRoot Gazetteer, but in this case the PRs must use the default annotation set for output. If you need to use a diﬀerent annotation set for your main pipeline’s output then you will need to create separate PRs speciﬁcally for the root ﬁnder and conﬁgure those to use the default set.

13.8.3 Simple steps to run OntoRoot Gazetteer [#]

OntoRoot Gazetteer is a part of the Gazetteer_Ontology_Based plugin.

Easy way [#]

For a quick start with the OntoRoot Gazetteer, consider running it from the GATE Developer (GATE GUI):

Start GATE
Load a sample application from resources folder (exampleApp.xgapp). This will load CAT App application.
Run CAT App application and open query-doc to see a set of Lookup annotations generated as a result (see Figure 13.4).

Figure 13.4: Sample ontology-based annotation as a result of running OntoRoot Gazetteer. Feature URI refers to the URI of the ontology resource, while type identiﬁes the type of the resource such as class, instance, property, or datatypePropertyValue

Hard way [#]

OntoRoot Gazetteer can easily be set up to be used with any ontology. To generate a GATE application which demonstrates the use of the OntoRoot Gazetteer, follow these steps:

Start GATE
Load necessary plugins: Click on Manage CREOLE plugins and check the following:
- Tools
- Ontology
- Ontology_Based_Gazetteer
- Ontology_Tools (optional); this parameter is required in order to view ontology using the GATE Ontology Editor.
- ANNIE.
Make sure that these plugins are loaded from GATE/plugins/[plugin_name] folder.
Load an ontology. Right click on Language Resource, and select the last option to create an OWLIM Ontology LR. Specify the format of the ontology, for example rdfXmlURL, and give the correct path to the ontology: either the absolute path on your local machine such as c:/myOntology.owl or the URL such as http://gate.ac.uk/ns/gate-ontology. Specify the name such as myOntology (this is optional).
Create Processing Resources: Right click on the Processing Resource and create the following PRs (with default parameters):
- Document Reset PR
- ANNIE English Tokeniser
- ANNIE POS Tagger
- GATE Morphological Analyser
- RegEx Sentence Splitter (or ANNIE Sentence Splitter)
Place the tokeniser, POS tagger and morphological analyser PRs into a new “corpus pipeline” application, named “Root ﬁnder”.
Create an Onto Root Gazetteer and set the init parameters. Mandatory ones are:
- ontology: select previously created myOntology;
- rootFinderApplication: select the “Root ﬁnder” pipeline you created above.
OntoRoot gazetteer is quite ﬂexible in that it can be conﬁgured using the optional parameters. List of all parameters is detailed in Section 13.8.2.
When all parameters are set click OK. It can take some time to initialise OntoRoot Gazetteer. For example, loading GATE knowledge base from http://gate.ac.uk/ns/gate-kb takes around 6-15 seconds. Larger ontologies can take much longer.
Create another PR which is a Flexible Gazetteer. As init parameters it is mandatory to select previously created OntoRoot Gazetteer for gazetteerInst. For another parameter, inputFeatureNames, click on the button on the right and when prompt with a window, add ’Token.root’ in the provided textbox, then click Add button. Click OK, give name to the new PR (optional) and then click OK.
Create an application. Right click on Application, then create a new Corpus Pipeline (or Conditional Corpus Pipeline). Add the following PRs to the application in this particular order:
- Document Reset PR
- RegEx Sentence Splitter (or ANNIE Sentence Splitter)
- ANNIE English Tokeniser
- ANNIE POS Tagger
- GATE Morphological Analyser
- Flexible Gazetteer
The tokeniser, POS tagger and morphological analyser may be the same ones used in the root ﬁnder application, or they may be diﬀerent (and must be diﬀerent if you want to use an annotation set other than the default one for this pipeline’s PRs).
Create a document to process with the new application; for example, if the ontology was http://gate.ac.uk/ns/gate-kb, then the document could be the GATE home page: http://gate.ac.uk. Run application and then investigate the results further. All annotations are of type Lookup, with additional features that give details about the resources they are referring to in the given ontology.

13.9 Large KB Gazetteer [#]

Note that from GATE 8.5 onwards the ontology plugin must have been loaded before you can load this plugin. The plugin also does not appear in the default list. It can be added by providing it’s Maven coordinates through the plugin manager. See https://github.com/GateNLP/gateplugin-Gazetteer_LKB for more information about the plugin and how to load it.

The large KB gazetteer provides support for ontology-aware NLP. You can load any ontology from RDF and then use the gazetteer to obtain lookup annotations that have both instance and class URI. Alternately, the PR can read in LST and DEF ﬁles in the same format as the Default Gazetteer. For both data sources, the Large KB Gazetteer PR allows to use large gazetteer lists and speeds up subsequent loading of the data by caching.

The large KB gazetteer is available as the plugin Gazetteer_LKB.

The current version of the large KB gazetteer does not use GATE ontology language resources. Instead, it uses its own mechanism to load and process ontologies.

The Large KB gazetteer grew from a component in the semantic search platform Ontotext KIM. The gazetteer was developed by people from the KIM team. You may ﬁnd the name kim left in several places in the source code, documentation or source ﬁles.

13.9.1 Quick usage overview

To use the Large KB gazetteer, set up your dictionary ﬁrst. The dictionary is a folder with some conﬁguration ﬁles. Use the samples at GATE_HOME/plugins/Gazetteer_LKB/samples as a guide. There are samples for loading the gazetteer data from a local repository, a remote repository or from a set of list ﬁles as conﬁgured in a .def ﬁle.
Load GATE_HOME/plugins/Gazetteer_LKB as a CREOLE plugin. See Section 3.5 for details.
Create a new ‘Large KB Gazetteer’ processing resource (PR). Put the folder of the dictionary you created in the ‘dictionaryPath’ parameter. You can leave the rest of the parameters as defaults.
Add the PR to your GATE application. The gazetteer doesn’t require a tokenizer or the output of any other processing resources.
The gazetteer will create annotations with type ‘Lookup’ and two features; ‘inst’, which contains the URI of the ontology instance, and ‘class’ which contains the URI of the ontology class that instance belongs to. If the gazetteer was loaded from gazetteer list ﬁles, the ‘inst’ and ‘class’ features are set from the ‘minorType’ and ‘majorType’ settings for the list ﬁle or from the ‘inst’ and ‘class’ features stored for each entry in the list ﬁle (see below).

13.9.2 Dictionary setup

The dictionary is a folder with some conﬁguration ﬁles. You can ﬁnd samples at GATE_HOME/plugins/Gazetteer_LKB/samples.

Setting up your own dictionary is easy. You need to deﬁne your RDF ontology and then specify a SPARQL or SERQL query that will retrieve a subset of that ontology as a dictionary.

conﬁg.ttl is a Turtle RDF ﬁle which conﬁgures a local RDF ontology or connection to a remote Sesame RDF database.

If you want to see examples of how to use local RDF ﬁles, please check samples/dictionary_from_local_ontology/conﬁg.ttl. The Sesame repository conﬁguration section conﬁgures a local Ontotext SwiftOWLIM database that loads a list of RDF ﬁles. Simply create a list of your RDF ﬁles and reuse the rest of the conﬁguration. The sample conﬁguration support datasets with 10,000,000 triples with acceptable performance. For working with larger datasets, advanced users can substitute SwiftOWLIM with another Sesame RDF engine. In that case, make sure you add the necessary JARs to the list in GATE_HOME/plugins/Gazetteer_LKB/creole.xml. For example, Ontotext BigOWL is a Sesame RDF engine that can load billions of triples on desktop hardware.

Since any Sesame repository can be conﬁgured in conﬁg.ttl, the Large KB Gazetteer can extract dictionaries from all signiﬁcant RDF databases. See the page on database compatibility for more information.

query.txt contains a SPARQL query. You can write any query you like, as long as its projection contains at least two columns in the following order: label and instance. As an option, you can also add a third column for the ontology class of the RDF entity. Below you can see a sample query, which creates a dictionary from the names and the unique identiﬁers of 10,000 entertainers in DbPedia.

PREFIX opencyc: <http://sw.opencyc.org/2008/06/10/concept/en/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?Name ?Person WHERE {
        ?Person a opencyc:Entertainer ; rdfs:label ?Name .
    FILTER (lang(?Name) = "en")
} LIMIT 10000

Try this query at the Linked Data Semantic Repository.

When you load the dictionary conﬁguration in GATE for the ﬁrst time, it creates a binary snapshot of the dictionary. Thereafter it will load only this binary snapshot. If the dictionary conﬁguration is changed, the snapshot will be reinitialized automatically. For more information, please see the dictionary lifecycle speciﬁcation.

13.9.3 Additional dictionary conﬁguration

The conﬁg.ttl may contain additional dictionary conﬁguration. Such conﬁguration concerns only the initial loading of the dictionary from the RDF database. The options are still being determined and more will appear in future versions. They must be placed below the repository conﬁguration section as attributes of a dictionary conﬁguration. Here is a sample conﬁg.ttl ﬁle with additional conﬁguration.

# Sesame configuration template for a (proxy for a) remote repository
#
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix hr: <http://www.openrdf.org/config/repository/http#>.
@prefix lkbg: <http://www.ontotext.com/lkb_gazetteer#>.

[] a rep:Repository ;
   rep:repositoryImpl [
      rep:repositoryType "openrdf:HTTPRepository" ;
      hr:repositoryURL <http://ldsr.ontotext.com/openrdf-sesame/repositories/owlim>
   ];
   rep:repositoryID "owlim" ;
   rdfs:label "LDSR" .

[] a lkbg:DictionaryConfiguration ;
  lkbg:caseSensitivity "CASE_INSENSITIVE" .

13.9.4 Dictionary for Gazetteer List Files

In order to load the gazetteer from gazetteer list ﬁles, place a ﬁle with extension .def in the dictionary directory. This ﬁle should have the same format as for the Default Gazetteer, however the ﬁelds for language and annotation type are ignored. The values for the ﬁelds ‘majorType’ and ‘minorType’ are converted to URIs by prepending ‘urn:’ and the URI from ‘majorType’ is used for assigning the ‘class’ feature, the URI from ‘minorType’ for assigning the ‘inst’ feature of all annotations created from the list. These valus can be overwritted by features from the individual entries in a list ﬁle, if the ‘gazetteerFeatureSeparator’ conﬁguaration option is set to a non-blank value in the conﬁguration ﬁle. In that case, each line in a list ﬁle is split by the separator and the ﬁrst part is used as the gazetteer entry. The remaining parts are interpreted as feature/value pairs of the form feature=value. If a feature ‘inst’ is found the value is used as the inst URI for that entry, overwriting the URI taken from the ‘minorType’ of the list ﬁle. If a feature ‘class’ is found, the value is used as the class URI for that entry, overwriting the URI taken from the ‘majorType’ of the list ﬁle.

The config.ttl ﬁle for a loading the gazetteer from list ﬁles should have the following content (the example shows a tab character to be used as the feature separator for list ﬁles):

@prefix lkbg: <http://www.ontotext.com/lkb_gazetteer#>.
lkbg:DictionaryConfiguration
   lkbg:caseSensitivity "caseinsensitive" ;
   lkbg:caching "enabled" ;
   lkbg:ignoreList "ignoreList.txt" ;
   lkbg:gazetteerFeatureSeparator "\t" .

13.9.5 Processing Resource Conﬁguration

The following options can be set when the gazetteer PR is initialized:

dictionaryPath; the dictionary folder described above.
forceCaseSensitive; whether the gazetteer should return case-sensitive matches regardless of the loaded dictionary.

13.9.6 Runtime conﬁguration

annotationSetName - The annotation set, which will receive the generated lookup annotations.
annotationLimit - The maximum number of the generated annotations. NULL or 0 for no limit. Setting limit of the number of the created annotations will reduce the memory consumption of GATE on large documents. Note that GATE documents consume gigabytes of memory if there are tens of thousands of annotations in the document. All PRs that create large number of annotations like the gazetteers and tokenizers may cause an Out Of Memory error on large texts. Setting that option limits the amount of memory that the gazetteer will use.

13.9.7 Semantic Enrichment PR [#]

The Semantic Enrichment PR allows adding new data to semantic annotations by querying external RDF (Linked Data) repositories. It is a companion to the large KB gazetteer that showcases the usefulness of using Linked Data URI as identiﬁers.

Here a semantic annotation is an annotation that is linked to an RDF entity by having the URI of the entity in the ‘inst’ feature of the annotation. For all such annotation of a given type, this PR runs a SPARQL query against the deﬁned repository and puts a comma-separated list of the values mentioned in the query output in the ‘connections’ feature of the same annotation.

There is a sample pipeline that features the Semantic Enrichment PR.

Parameters

inputASName; the annotation set, which annotation will be processed.
server; the URL of the Sesame 2 HTTP repository. Support for generic SPARQL endpoints can be implemented if required.
repositoryId; the ID of the Sesame repository.
annotationTypes; a list of types of annotation that will be processed.
query; a SPARQL query pattern. The query will be processed like this - String.format(query, uriFromAnnotation), so you can use parameters like %s or %1$s.
deleteOnNoRelations; whether we want to delete the annotation that weren’t enriched. Helps to clean up the input annotations.

13.10 The Shared Gazetteer for multithreaded processing [#]

The DefaultGazetteer (and its subclasses such as the OntoRootGazetteer) compiles its gazetteer data into a ﬁnite state matcher at initialization time. For large gazetteers this FSM requires a considerable amount of memory. However, once the FSM has been built then (as long as you do not modify it dynamically using Gaze) it is accessed in a read-only manner at runtime. For a multi-threaded application that requires several identical copies of its processing resources (see section 7.14), GATE provides a mechanism whereby a single compiled FSM can be shared between several gazetteer PRs that can then be executed concurrently in diﬀerent threads, saving the memory that would otherwise be required to load the lists several times.

This feature is not available in the GATE Developer GUI, as it is only intended for use in embedded code. To make use of it, ﬁrst create a single instance of the regular DefaultGazetteer or OntoRootGazetteer:

FeatureMap params = Factory.newFeatureMap();
params.put("listsUrl", listsDefLocation);
LanguageAnalyser mainGazetteer = (LanguageAnalyser)Factory.createResource(
    "gate.creole.gazetteer.DefaultGazetteer", params);

Then create any number of SharedDefaultGazetteer instances, passing this regular gazetteer as a parameter:

FeatureMap params = Factory.newFeatureMap();
params.put("bootstrapGazetteer", mainGazetteer);
LanguageAnalyser sharedGazetteer = (LanguageAnalyser)Factory.createResource(
    "gate.creole.gazetteer.SharedDefaultGazetteer", params);

The SharedDefaultGazetteer instance will re-use the FSM that was built by the mainGazetteer instead of loading its own.

13.11 Extended Gazetteer [#]

The ExtendedGazetteer is part of the StringAnnotation plugin and works similar to the default ANNIE Gazetteer PR, but with the following changes and additions:

it scales to very large gazetters by implementing an optimized compressed trie data structure which allows to ﬁt huge gazetteer lists into memory. Unlike the LKB Gazetteer it still allows arbitrary features per entry.
it automatically properly supports duplication, the loaded gazetteer gets automatically shared between all duplicates.
the compressed data structre is cashed to disk, so loading the gazetteer is faster once the cashe ﬁle has been created
it can be used to match either the original document text (as the default ANNIE Gazetteer) or the "virtual document view" created from the values of some feature for some annotation type (similar to how the Flexible Gazetteer works)
it requires word annotations and non-word/space annotations (e.g. the Token/SpaceToken annotations created by ANNIE) and uses those to determine word boundaries in the document (instead of the whitespace patterns used by the default ANNIE gazetteer). In addition it is possible to deﬁne a "Split" annotation type so that a gazetteer match will never cross or include a split annotation (e.g. to prevent matches across sentence boundaries).
All ﬁles have to use UTF-8 encoding by default
The default feature separator is a tab character, not a colon
For case-insensitive matches, the upper-case version of strings from the gazetteer list and the document are compared, and the language for converting to upper case can be speciﬁed (e.g. in German, a ß character is converted to "SS").

More detailed documentation of the ExtendedGazetteer is available online at https://gatenlp.github.io/gateplugin-StringAnnotation/ExtendedGazetteer.

13.12 Feature Gazetteer [#]

The FeatureGazetteer is part of the StringAnnotation plugin and can be used to match annotation features against a gazetteer list and if there is a match perform one of the following actions:

add new features from the gazetteer entry to the annotation
remove that annotation
add a new annotation

More detailed documentation of the FeatureGazetteer is available online at https://gatenlp.github.io/gateplugin-StringAnnotation/FeatureGazetteer.

¹An ontology resource is usually identiﬁed by an URI concatenated with a set of characters starting with ‘#’. This set of characters is called fragment identiﬁer. For example, if the URI of a class representing GATE POS Tagger is: ’http://gate.ac.uk/ns/gate-ontology#POSTagger’, the fragment identiﬁer will be ’POSTagger’.

²In previous versions of GATE the gazetteer took three separate parameters for the tokeniser, POS tagger and morphological analyser. Existing saved applications that use these parameters will still work in GATE 8.0.

[next] [front] [up]

Chapter 13Gazetteers [#]