Chapter 13
Gazetteers [#]
13.1 Introduction to Gazetteers [#]
A gazetteer consists of a set of lists containing names of entities such as cities, organisations, days of the week, etc. These lists are used to find occurrences of these names in text, e.g. for the task of named entity recognition. The word ‘gazetteer’ is often used interchangeably for both the set of entity lists and for the processing resource that makes use of those lists to find occurrences of the names in text.
When a gazetteer processing resource is run on a document, annotations of type Lookup are created for each matching string in the text. Gazetteers usually do not depend on Tokens or on any other annotation and instead find matches based on the textual content of the document. (the Flexible Gazetteer, described in section 13.6, being the exception to the rule). This means that an entry may span more than one word and may start or end within a word. If a gazetteer that directly works on text does respect word boundaries, the way how word boundaries are found might differ from the way the GATE tokeniser finds word boundaries. A Lookup annotation will only be created if the entire gazetteer entry is matched in the text. The details of how gazetteer entries match text depend on the gazetteer processing resource and its parameters. In this chapter, we will cover several gazetteers.
13.2 ANNIE Gazetteer [#]
The rest of this introductory section describes the ANNIE Gazetteer which is part of ANNIE and also described in section 6.3. The ANNIE gazetteer is part of and provided by the ANNIE plugin.
Each individual gazetteer list is a plain text file, with one entry per line.
Below is a section of the list for units of currency:
Ecu European Currency Units FFr Fr German mark German marks New Taiwan dollar New Taiwan dollars NT dollar NT dollars
An index file (usually called lists.def) is used to describe all such gazetteer list files that belong together. Each gazetteer list should reside in the same directory as the index file.
The gazetteer index files describes for each list the major type and optionally, a minor type, a language and an annotation type, separated by colons. In the example below, the first column refers to the list name, the second column to the major type, the third to the minor type, the fourth column to the language and the fifth column to the annotation type. These lists are compiled into finite state machines. Any text strings matched by these machines will be annotated with features specifying the major and minor types.
currency_prefix.lst:currency_unit:pre_amount currency_unit.lst:currency_unit:post_amount date.lst:date:specific_date::Date day.lst:date:day monthen.lst:date:month:en monthde.lst:date:month:de season.lst:date:season
The major and minor type as well as the language will be added as features to only Lookup annotation generated from a matching entry from the respective list. For example, if an entry from the currency_unit.lst gazetteer list matches some text in a document, the gazetteer processing resource will generate a Lookup annotation spanning the matching text and assign the features major="currency_unit" and minor="post_amount" to that annotation.
By default the ANNIE Gazetteer PR creates Lookup annotations. However, if a user has specified a specific annotation type for a list, the Gazetteer uses the specified annotation type to annotate entries that are part of the specified list and appear in the document being processed.
Grammar rules (JAPE rules) can specify the types to be identified in particular circumstances. The major and minor types enable this identification to take place, by giving access to items stored in particular lists or combinations of lists.
For example, if a day needs to be identified, the minor type ‘day’ would be specified in the grammar, in order to match only information about specific days. If any kind of date needs to be identified, the major type ‘date’ would be specified. This might include weeks, months, years etc. as well as days of the week, and would give access to all the items stored in day.lst, month.lst, season.lst, and date.lst in the example shown.
13.2.1 Creating and Modifying Gazetteer Lists
Gazetteer lists can be modified using any text editor or an editor inside GATE when you double-click on the gazetteer in the resources tree. Use of an editor that can edit Unicode UTF-8 files (e.g. the GATE Unicode editor) is advised, however, in order to ensure that the lists are stored as UTF-8, which will minimise any language encoding problems, particularly if e.g. accents, umlauts or characters from non-Latin scripts are present.
To create a new list, simply add an entry for that list to the definitions file and add the new list in the same directory as the existing lists.
After any modifications have been made in an external editor, ensure that you reinitialise the gazetteer PR in GATE, if one is already loaded, before rerunning your application.
13.2.2 ANNIE Gazetteer Editor [#]
To open this edior, double-click on the gazetteer in the resources tree.
It is composed of two tables:
-
a left table with 5 columns (List name, Major, Minor, Language, Annotation type) for the index, usually a .def file
-
a right table with 1+2*n columns (Value, Feature 1, Value 1...Feature n, Value n) for the lists, usually .lst files
When selecting a list in the left table you get its content displayed in the right table.
You can sort both tables by clicking on their column headers. A text field ‘Filter’ at the bottom of the right table allows to display only the rows that contain the expression you typed.
To edit a value in a table, double click on a cell or press F2 then press Enter when finished editing the cell. To add a new row in both tables use the text field at the top and press Enter or use the ‘New’ button next to it. When adding a new list you can select from the list of existing gazetteer lists in the current directory or type a new file name. To delete a row, press Shift+Delete or use the context menu. To delete more than one row select them before.
You can reload a modified list by selecting it and right-clicking for the context menu item ‘Reload List’ or by pressing Control+R. When a list is modified its name in the left table is coloured in red.
If you have set ‘gazetteerFeatureSeparator’ parameter then the right table will show a ‘Feature’ and ‘Value’ columns for each feature. To add a new couple of columns use the button ‘Add Cols’.
Note that in the left table, you can only select one row at a time.
The gazetteer like other language resource has a context menu in the resources tree to ‘Reinitialise’, ‘Save’ or ‘Save as...’ the resource.
The right table has a context menu for the current selection to help you creating new gazetteer. It is similar with the actions found in a spreadsheet application like ‘Fill Down Selection’, ‘Clear Selection’, ‘Copy Selection’, ‘Paste Selection’, etc.
13.3 OntoGazetteer [#]
The Ontogazetteer, or Hierarchical Gazetteer, is a processing resource which can associate the entities from a specific gazetteer list with a class in a GATE ontology language resource. The OntoGazetteer assigns classes rather than major or minor types, and is aware of mappings between lists and class IDs. The Gaze visual resource can display the lists, ontology mappings and the class hierarchy of the ontology for a OntoGazetteer processing resource and provides ways of editing these components.
13.4 Gaze Ontology Gazetteer Editor [#]
This section describes the Gaze gazetteer editor when it displays an OntoGazetteer processing resource. The editor consists of two parts: one for the editing of the lists and the mapping of lists and one for editing the ontology. These two parts are described in the following subsections.
13.4.1 The Gaze Gazetteer List and Mapping Editor
This is a VR for editing the gazetteer lists, and mapping them to classes in an ontology. It provides load/store/edit for the lists, load/store/edit for the mapping information, loading of ontologies, load/store/edit for the linear definition file, and mapping of the lists file to the major type, minor type and language.
Left pane: A single ontology is visualized in the left pane of the VR. The mapping between a list and a class is displayed by showing the list as a subclass with a different icon. The mapping is specified by drag and drop from the linear definition pane (in the middle) and/or by right click menu.
Middle pane: The middle pane displays the nodes/lines in the linear definition file. By double clicking on a node the corresponding list is opened. Editing of the line/node is done by right clicking and choosing edit: a dialogue appears (lower part of the scheme) allowing the modification of the members of the node.
Right pane: In the right pane a single gazetteer list is displayed. It can be edited and parts of it can be cut/copied/pasted.
13.4.2 The Gaze Ontology Editor
Note: to edit ontologies within gate, the more recent ontology viewer editor provided by the Ontology_Tools which provides many more features can be used, see section 14.4.
This is a VR for editing the class hierarchy of an ontology. it provides storing to and loading from RDF/RDFS, and provides load/edit/store of the class hierarchy of an ontology.
Left pane: The various ontologies loaded are listed here. On double click or right click and edit from the menu the ontology is visualized in the Right pane.
Right pane: Besides the visualization of the class hierarchy of the ontology the following operations are allowed:
-
expanding/collapsing parts of the ontology
-
adding a class in the hierarchy: by right clicking on the intended parent of the new class and choosing add sub class.
-
removing a class: via right clicking on the class and choosing remove.
As a result of this VR, the ontology definition file is affected/altered.
13.5 Hash Gazetteer [#]
The Hash Gazetteer is a gazetteer implemented by the OntoText Lab ( http://www.ontotext.com/). Its implementation is based on simple lookup in several java.util.HashMap objects, and is inspired by the strange idea of Atanas Kiryakov, that searching in HashMaps may be faster than in a Finite State Machine (FSM). The Hash Gazetteer processing resource is part of the ANNIE plugin.
This gazetteer processing resource is implemented in the following way: Every phrase i.e. every list entry is separated into several parts. The parts are determined by the whitespaces lying among them; e.g., the phrase “form is emptiness” has three parts: “form”, “is”, and “emptiness”. There is also a list of HashMaps: mapsList which has as many elements as the longest (in terms of ‘count of parts’) phrase in the lists. So the first part of a phrase is placed in the first map. The first part + space + second part is placed in the second map, etc. The full phrase is placed in the appropriate map, and a reference to a Lookup object is attached to it.
On first sight it seems that this algorithm is certainly much more memory-consuming than a finite state machine (FSM) with the parts of the phrases as transitions, but this is actually not so important since the average length of the phrases (in parts) in the lists is 1.1. On the other hand, one advantage of the algorithm is that, although unconventional, it takes less memory and may be slightly faster, especially if you have a very large gazetteer (e.g., 100,000s of entries).
13.5.1 Prerequisites
The phrases to be recognised should be listed in a set of files, one for each type of occurrence (as for the standard gazetteer).
The gazetteer is built with the information from a file that contains the set of lists (which are files as well) and the associated type for each list. The file defining the set of lists should have the following syntax: each list definition should be written on its own line and should contain:
-
the file name (required)
-
the major type (required)
-
the minor type (optional)
-
the language(s) (optional)
The elements of each definition are separated by ‘:’. The following is an example of a valid definition:
personmale.lst:person:male:english
Each file named in the lists definition file is just a list containing one entry per line.
When this gazetteer is run over some input text (a GATE document) it will generate annotations of type Lookup having the attributes specified in the definition file.
13.5.2 Parameters
The Hash Gazetteer processing resource allows the specification of the following parameters when it is created:
-
caseSensitive:
-
this can be switched between true and false to indicate if matches should be done in a case-sensitive way.
-
encoding:
-
the encoding of the gazetteer lists
-
listsURL:
-
the URL of the list definitions (index) file, i.e. the file that contains the filenames, major types and optionally minor types and languages of all the list files.
There is one run-time parameter, annotationSetName that allows the specification of the annotation set in which the Lookup annotations will be created. If nothing is specified the default annotation set will be used.
Note that the Hash Gazetteer does not have the longestMatchOnly and wholeWordsOnly parameters; if you need to configure these options, you should use the another gazetteer that supports them, such as the standard ANNIE Gazetteer (see section 13.2).
13.6 Flexible Gazetteer [#]
The Flexible Gazetteer provides users with the flexibility to choose their own customized input and an external Gazetteer. For example, the user might want to replace words in the text with their base forms (which is an output of the Morphological Analyser) before running the Gazetteer.
The Flexible Gazetteer performs lookup over a document based on the values of an arbitrary feature of an arbitrary annotation type, by using an externally provided gazetteer. It is important to use an external gazetteer as this allows the use of any type of gazetteer (e.g. an Ontological gazetteer).
Input to the Flexible Gazetteer:
Runtime parameters:
-
Document – the document to be processed
-
inputASName The annotationSet where the Flexible Gazetteer should search for the AnnotationType.feature specified in the inputFeatureNames.
-
outputASName The AnnotationSet where Lookup annotations should be placed.
Creation time parameters:
-
inputFeatureNames – when selected, these feature values are used to replace the corresponding original text. For each feature, a temporary document is created from the values of the specified features on the specified annotation types. For example: for Token.root the temporary document will have content of every Token replaced with its root value. In case of overlapping annotations of the same type in the input, only the value of the first annotation is considered. Here, please note that the order of annotations is decided by using the gate.util.OffsetComparator class.
-
gazetteerInst – the actual gazetteer instance, which should run over a temporary document. This generates the Lookup annotations with features. This must be an instance of gate.creole.gazetteer.Gazetteer which has already been created. All such instances will be shown in the dropdown menu for this parameter in GATE Developer.
Once the external gazetteer has annotated text with Lookup annotations, Lookup annotations on the temporary document are converted to Lookup annotations on the original document. Finally the temporary document is deleted.
13.7 Gazetteer List Collector [#]
The gazetteer list collector, found in the Tools plugin, collects occurrences of entities directly from a set of annotated training documents and populates gazetteer lists with the entities. The entity types and structure of the gazetteer lists are defined as necessary by the user. Once the lists have been collected, a semantic grammar can be used to find the same entities in new texts.
The target gazetteer must contain a list corresponding exactly to each annotation type to be collection (for example, Person.lst for the Person annotations, Organization.lst for the Organization annotations, etc.). You can use the gazetteer editor to create new empty lists for types that are not already in your gazetteer. Note that if you do this, you will need to “Save and Reinitialise” the gazetteer later (the collector updates the *.lst files on disk, but not the lists.def file).
If a list in the gazetteer already contains entries, the collector will add new entries, but it will only collect one occurrence of each new entry; it checks that the entry is not present already before adding it.
There are 4 runtime parameters:
-
annotationTypes: a list of the annotation types that should be collected
-
gazetteer: the gazetteer where the results will be stored (this must be already loaded in GATE)
-
markupASname: the annotation set from which the annotation types should be collected
-
theLanguage: sets the language feature of the gazetteer lists to be created to the appropriate language (in the case where lists are collected for different languages)
Figure 13.2 shows a screenshot of a set of lists collected automatically for the Hindi language. It contains 4 lists: Person, Organisation, Location and a list of stopwords. Each list has a majorType whose value is the type of list, a minorType ‘inferred’ (since the lists have been inferred from the text), and the language ‘Hindi’.
The list collector also has a facility to split the Person names that it collects into their individual tokens, so that it adds both the entire name to the list, and adds each of the tokens to the list (i.e. each of the first names, and the surname) as a separate entry. When the grammar annotates Persons, it can require them to be at least 2 tokens or 2 consecutive Person Lookups. In this way, new Person names can be recognised by combining a known first name with a known surname, even if they were not in the training corpus. Where only a single token is found that matches, an Unknown entity is generated, which can later be matched with an existing longer name via the orthomatcher component which performs orthographic coreference between named entities. This same procedure can also be used for other entity types. For example, parts of Organisation names can be combined together in different ways. The facility for splitting Person names is hardcoded in the file gate/src/gate/creole/GazetteerListsCollector.java and is commented.
13.8 OntoRoot Gazetteer [#]
OntoRoot Gazetteer is a type of a dynamically created gazetteer that is, in combination with few other generic GATE resources, capable of producing ontology-based annotations over the given content with regards to the given ontology. This gazetteer is a part of ‘Gazetteer_Ontology_Based’ plugin that has been developed as a part of the TAO project.
13.8.1 How Does it Work? [#]
To produce ontology-based annotations i.e. annotations that link to the specific concepts or relations from the ontology, it is essential to pre-process the Ontology Resources (e.g., Classes, Instances, Properties) and extract their human-understandable lexicalisations.
As a precondition for extracting human-understandable content from the ontology, first a list of the following is being created:
-
names of all ontology resources i.e. fragment identifiers 1 and
-
assigned property values for all ontology resources (e.g., label and datatype property values)
Each item from the list is further processed so that:
-
any name containing dash ("-") or underline ("_") character(s) is processed so that each of these characters is replaced by a blank space. For example, Project_Name or Project-Name would become a Project Name.
-
any name that is written in camelCase style is actually split into its constituent words, so that ProjectName becomes a Project Name (optional).
-
any name that is a compound name such as ‘POS Tagger for Spanish’ is split so that both ‘POS Tagger’ and ‘Tagger’ are added to the list for processing. In this example, ‘for’ is a stop word, and any words after it are ignored (optional).
Each item from this list is analysed separately by the Onto Root Application (ORA) on execution (see figure 13.3). The Onto Root Application first tokenises each linguistic term, then assigns part-of-speech and lemma information to each token.
As a result of that pre-processing, each token in the terms will have additional feature named ‘root’, which contains the lemma as created by the morphological analyser. It is this lemma or a set of lemmas which are then added to the dynamic gazetteer list, created from the ontology.
For instance, if there is a resource with a short name (i.e., fragment identifier) ProjectName, without any assigned properties the created list before executing the OntoRoot gazetteer collection will contain the following strings:
-
‘ProjectName’,
-
‘Project Name’ after separating camelCased word and
-
‘Name’ after applying heuristic rules.
Each of the item from the list is then analysed separately and the results would be the same as the input strings, as all of entries are nouns given in singular form.
13.8.2 Initialisation of OntoRoot Gazetteer [#]
To initialise the gazetteer there are few mandatory parameters:
-
Ontology to be processed;
-
CorpusController to process the ontology terms2. This application will be run on a document that contains a single Sentence annotation spanning the whole document, and is expected to produce annotations of type Token in the default annotation set, with features category (the POS tag) and root (the morphological root). Typically this pipeline would contain a tokeniser appropriate to the source language, a POS tagger, and a GATE Morphological Analyser PR, but any application that will produce the right annotation types and features will work. For example, when processing non-English text you may need to use an alternative POS tagger such as the Stanford tagger or TreeTagger.
and few optional ones:
-
useResourceUri, default is set to true - should this gazetteer analyse resource URIs or not;
-
considerProperties, default is set to true - should this gazetteer consider properties or not;
-
propertiesToInclude - checked only if considerProperties is set to true - this parameter contains the list of property names (URIs) to be included, comma separated;
-
propertiesToExclude - checked only if considerProperties is set to true - this parameter contains the list of property names to be excluded, comma separated;
-
caseSensitive, default set to be false -should this gazetteer differentiate on case;
-
separateCamelCasedWords, default set to true - should this gazetteer separate emphcamelCased words, e.g. ‘ProjectName’ into ‘Project Name’;
-
considerHeuristicRules, default set to false - should this gazetteer consider several heuristic rules or not. Rules include splitting the words containing spaces, and using prepositions as stop words; for example, if ’pos tagger for Spanish’ would be analysed, ‘for’ would be considered as a stop word; heuristically derived would be ‘pos tagger’ and this would be further used to add ‘pos tagger’ to the gazetteer list, with a feature emphheuristical level set to be 0, and ‘tagger’ with emphheuristical level 1; at runtime lower heuristical level should be preferred. NOTE: setting considerHeuristicRules to true can cause a lot of noise for some ontologies and is likely to require implementing an additional filtering resource that will prefer the annotations with the lower heuristic level;
The OntoRoot Gazetteer’s initialization preprocesses strings from the ontology and runs the root finder application over them. It is possible to re-use the same tokeniser, POS tagger and morphological analyser PR instances in both the root finder application and the main pipeline that will contain the finished OntoRoot Gazetteer, but in this case the PRs must use the default annotation set for output. If you need to use a different annotation set for your main pipeline’s output then you will need to create separate PRs specifically for the root finder and configure those to use the default set.
13.8.3 Simple steps to run OntoRoot Gazetteer [#]
OntoRoot Gazetteer is a part of the Gazetteer_Ontology_Based plugin.
Easy way [#]
For a quick start with the OntoRoot Gazetteer, consider running it from the GATE Developer (GATE GUI):
-
Start GATE
-
Load a sample application from resources folder (exampleApp.xgapp). This will load CAT App application.
-
Run CAT App application and open query-doc to see a set of Lookup annotations generated as a result (see Figure 13.4).
Hard way [#]
OntoRoot Gazetteer can easily be set up to be used with any ontology. To generate a GATE application which demonstrates the use of the OntoRoot Gazetteer, follow these steps:
-
Start GATE
-
Load necessary plugins: Click on Manage CREOLE plugins and check the following:
-
Tools
-
Ontology
-
Ontology_Based_Gazetteer
-
Ontology_Tools (optional); this parameter is required in order to view ontology using the GATE Ontology Editor.
-
ANNIE.
Make sure that these plugins are loaded from GATE/plugins/[plugin_name] folder.
-
-
Load an ontology. Right click on Language Resource, and select the last option to create an OWLIM Ontology LR. Specify the format of the ontology, for example rdfXmlURL, and give the correct path to the ontology: either the absolute path on your local machine such as c:/myOntology.owl or the URL such as http://gate.ac.uk/ns/gate-ontology. Specify the name such as myOntology (this is optional).
-
Create Processing Resources: Right click on the Processing Resource and create the following PRs (with default parameters):
-
Document Reset PR
-
ANNIE English Tokeniser
-
ANNIE POS Tagger
-
GATE Morphological Analyser
-
RegEx Sentence Splitter (or ANNIE Sentence Splitter)
Place the tokeniser, POS tagger and morphological analyser PRs into a new “corpus pipeline” application, named “Root finder”.
-
-
Create an Onto Root Gazetteer and set the init parameters. Mandatory ones are:
-
ontology: select previously created myOntology;
-
rootFinderApplication: select the “Root finder” pipeline you created above.
OntoRoot gazetteer is quite flexible in that it can be configured using the optional parameters. List of all parameters is detailed in Section 13.8.2.
When all parameters are set click OK. It can take some time to initialise OntoRoot Gazetteer. For example, loading GATE knowledge base from http://gate.ac.uk/ns/gate-kb takes around 6-15 seconds. Larger ontologies can take much longer.
-
-
Create another PR which is a Flexible Gazetteer. As init parameters it is mandatory to select previously created OntoRoot Gazetteer for gazetteerInst. For another parameter, inputFeatureNames, click on the button on the right and when prompt with a window, add ’Token.root’ in the provided textbox, then click Add button. Click OK, give name to the new PR (optional) and then click OK.
-
Create an application. Right click on Application, then create a new Corpus Pipeline (or Conditional Corpus Pipeline). Add the following PRs to the application in this particular order:
-
Document Reset PR
-
RegEx Sentence Splitter (or ANNIE Sentence Splitter)
-
ANNIE English Tokeniser
-
ANNIE POS Tagger
-
GATE Morphological Analyser
-
Flexible Gazetteer
The tokeniser, POS tagger and morphological analyser may be the same ones used in the root finder application, or they may be different (and must be different if you want to use an annotation set other than the default one for this pipeline’s PRs).
-
-
Create a document to process with the new application; for example, if the ontology was http://gate.ac.uk/ns/gate-kb, then the document could be the GATE home page: http://gate.ac.uk. Run application and then investigate the results further. All annotations are of type Lookup, with additional features that give details about the resources they are referring to in the given ontology.
13.9 Large KB Gazetteer [#]
Note that from GATE 8.5 onwards the ontology plugin must have been loaded before you can load this plugin. The plugin also does not appear in the default list. It can be added by providing it’s Maven coordinates through the plugin manager. See https://github.com/GateNLP/gateplugin-Gazetteer_LKB for more information about the plugin and how to load it.
The large KB gazetteer provides support for ontology-aware NLP. You can load any ontology from RDF and then use the gazetteer to obtain lookup annotations that have both instance and class URI. Alternately, the PR can read in LST and DEF files in the same format as the Default Gazetteer. For both data sources, the Large KB Gazetteer PR allows to use large gazetteer lists and speeds up subsequent loading of the data by caching.
The large KB gazetteer is available as the plugin Gazetteer_LKB.
The current version of the large KB gazetteer does not use GATE ontology language resources. Instead, it uses its own mechanism to load and process ontologies.
The Large KB gazetteer grew from a component in the semantic search platform Ontotext KIM. The gazetteer was developed by people from the KIM team. You may find the name kim left in several places in the source code, documentation or source files.
13.9.1 Quick usage overview
-
To use the Large KB gazetteer, set up your dictionary first. The dictionary is a folder with some configuration files. Use the samples at GATE_HOME/plugins/Gazetteer_LKB/samples as a guide. There are samples for loading the gazetteer data from a local repository, a remote repository or from a set of list files as configured in a .def file.
-
Load GATE_HOME/plugins/Gazetteer_LKB as a CREOLE plugin. See Section 3.5 for details.
-
Create a new ‘Large KB Gazetteer’ processing resource (PR). Put the folder of the dictionary you created in the ‘dictionaryPath’ parameter. You can leave the rest of the parameters as defaults.
-
Add the PR to your GATE application. The gazetteer doesn’t require a tokenizer or the output of any other processing resources.
-
The gazetteer will create annotations with type ‘Lookup’ and two features; ‘inst’, which contains the URI of the ontology instance, and ‘class’ which contains the URI of the ontology class that instance belongs to. If the gazetteer was loaded from gazetteer list files, the ‘inst’ and ‘class’ features are set from the ‘minorType’ and ‘majorType’ settings for the list file or from the ‘inst’ and ‘class’ features stored for each entry in the list file (see below).
13.9.2 Dictionary setup
The dictionary is a folder with some configuration files. You can find samples at GATE_HOME/plugins/Gazetteer_LKB/samples.
Setting up your own dictionary is easy. You need to define your RDF ontology and then specify a SPARQL or SERQL query that will retrieve a subset of that ontology as a dictionary.
config.ttl is a Turtle RDF file which configures a local RDF ontology or connection to a remote Sesame RDF database.
If you want to see examples of how to use local RDF files, please check samples/dictionary_from_local_ontology/config.ttl. The Sesame repository configuration section configures a local Ontotext SwiftOWLIM database that loads a list of RDF files. Simply create a list of your RDF files and reuse the rest of the configuration. The sample configuration support datasets with 10,000,000 triples with acceptable performance. For working with larger datasets, advanced users can substitute SwiftOWLIM with another Sesame RDF engine. In that case, make sure you add the necessary JARs to the list in GATE_HOME/plugins/Gazetteer_LKB/creole.xml. For example, Ontotext BigOWL is a Sesame RDF engine that can load billions of triples on desktop hardware.
Since any Sesame repository can be configured in config.ttl, the Large KB Gazetteer can extract dictionaries from all significant RDF databases. See the page on database compatibility for more information.
query.txt contains a SPARQL query. You can write any query you like, as long as its projection contains at least two columns in the following order: label and instance. As an option, you can also add a third column for the ontology class of the RDF entity. Below you can see a sample query, which creates a dictionary from the names and the unique identifiers of 10,000 entertainers in DbPedia.
PREFIX opencyc: <http://sw.opencyc.org/2008/06/10/concept/en/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?Name ?Person WHERE { ?Person a opencyc:Entertainer ; rdfs:label ?Name . FILTER (lang(?Name) = "en") } LIMIT 10000
Try this query at the Linked Data Semantic Repository.
When you load the dictionary configuration in GATE for the first time, it creates a binary snapshot of the dictionary. Thereafter it will load only this binary snapshot. If the dictionary configuration is changed, the snapshot will be reinitialized automatically. For more information, please see the dictionary lifecycle specification.
13.9.3 Additional dictionary configuration
The config.ttl may contain additional dictionary configuration. Such configuration concerns only the initial loading of the dictionary from the RDF database. The options are still being determined and more will appear in future versions. They must be placed below the repository configuration section as attributes of a dictionary configuration. Here is a sample config.ttl file with additional configuration.
# Sesame configuration template for a (proxy for a) remote repository # @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. @prefix rep: <http://www.openrdf.org/config/repository#>. @prefix hr: <http://www.openrdf.org/config/repository/http#>. @prefix lkbg: <http://www.ontotext.com/lkb_gazetteer#>. [] a rep:Repository ; rep:repositoryImpl [ rep:repositoryType "openrdf:HTTPRepository" ; hr:repositoryURL <http://ldsr.ontotext.com/openrdf-sesame/repositories/owlim> ]; rep:repositoryID "owlim" ; rdfs:label "LDSR" . [] a lkbg:DictionaryConfiguration ; lkbg:caseSensitivity "CASE_INSENSITIVE" .
13.9.4 Dictionary for Gazetteer List Files
In order to load the gazetteer from gazetteer list files, place a file with extension .def in the dictionary directory. This file should have the same format as for the Default Gazetteer, however the fields for language and annotation type are ignored. The values for the fields ‘majorType’ and ‘minorType’ are converted to URIs by prepending ‘urn:’ and the URI from ‘majorType’ is used for assigning the ‘class’ feature, the URI from ‘minorType’ for assigning the ‘inst’ feature of all annotations created from the list. These valus can be overwritted by features from the individual entries in a list file, if the ‘gazetteerFeatureSeparator’ configuaration option is set to a non-blank value in the configuration file. In that case, each line in a list file is split by the separator and the first part is used as the gazetteer entry. The remaining parts are interpreted as feature/value pairs of the form feature=value. If a feature ‘inst’ is found the value is used as the inst URI for that entry, overwriting the URI taken from the ‘minorType’ of the list file. If a feature ‘class’ is found, the value is used as the class URI for that entry, overwriting the URI taken from the ‘majorType’ of the list file.
The config.ttl file for a loading the gazetteer from list files should have the following content (the example shows a tab character to be used as the feature separator for list files):
@prefix lkbg: <http://www.ontotext.com/lkb_gazetteer#>. lkbg:DictionaryConfiguration lkbg:caseSensitivity "caseinsensitive" ; lkbg:caching "enabled" ; lkbg:ignoreList "ignoreList.txt" ; lkbg:gazetteerFeatureSeparator "\t" .
13.9.5 Processing Resource Configuration
The following options can be set when the gazetteer PR is initialized:
-
dictionaryPath; the dictionary folder described above.
-
forceCaseSensitive; whether the gazetteer should return case-sensitive matches regardless of the loaded dictionary.
13.9.6 Runtime configuration
-
annotationSetName - The annotation set, which will receive the generated lookup annotations.
-
annotationLimit - The maximum number of the generated annotations. NULL or 0 for no limit. Setting limit of the number of the created annotations will reduce the memory consumption of GATE on large documents. Note that GATE documents consume gigabytes of memory if there are tens of thousands of annotations in the document. All PRs that create large number of annotations like the gazetteers and tokenizers may cause an Out Of Memory error on large texts. Setting that option limits the amount of memory that the gazetteer will use.
13.9.7 Semantic Enrichment PR [#]
The Semantic Enrichment PR allows adding new data to semantic annotations by querying external RDF (Linked Data) repositories. It is a companion to the large KB gazetteer that showcases the usefulness of using Linked Data URI as identifiers.
Here a semantic annotation is an annotation that is linked to an RDF entity by having the URI of the entity in the ‘inst’ feature of the annotation. For all such annotation of a given type, this PR runs a SPARQL query against the defined repository and puts a comma-separated list of the values mentioned in the query output in the ‘connections’ feature of the same annotation.
There is a sample pipeline that features the Semantic Enrichment PR.
Parameters
-
inputASName; the annotation set, which annotation will be processed.
-
server; the URL of the Sesame 2 HTTP repository. Support for generic SPARQL endpoints can be implemented if required.
-
repositoryId; the ID of the Sesame repository.
-
annotationTypes; a list of types of annotation that will be processed.
-
query; a SPARQL query pattern. The query will be processed like this - String.format(query, uriFromAnnotation), so you can use parameters like %s or %1$s.
-
deleteOnNoRelations; whether we want to delete the annotation that weren’t enriched. Helps to clean up the input annotations.
13.10 The Shared Gazetteer for multithreaded processing [#]
The DefaultGazetteer (and its subclasses such as the OntoRootGazetteer) compiles its gazetteer data into a finite state matcher at initialization time. For large gazetteers this FSM requires a considerable amount of memory. However, once the FSM has been built then (as long as you do not modify it dynamically using Gaze) it is accessed in a read-only manner at runtime. For a multi-threaded application that requires several identical copies of its processing resources (see section 7.14), GATE provides a mechanism whereby a single compiled FSM can be shared between several gazetteer PRs that can then be executed concurrently in different threads, saving the memory that would otherwise be required to load the lists several times.
This feature is not available in the GATE Developer GUI, as it is only intended for use in embedded code. To make use of it, first create a single instance of the regular DefaultGazetteer or OntoRootGazetteer:
FeatureMap params = Factory.newFeatureMap(); params.put("listsUrl", listsDefLocation); LanguageAnalyser mainGazetteer = (LanguageAnalyser)Factory.createResource( "gate.creole.gazetteer.DefaultGazetteer", params);
Then create any number of SharedDefaultGazetteer instances, passing this regular gazetteer as a parameter:
FeatureMap params = Factory.newFeatureMap(); params.put("bootstrapGazetteer", mainGazetteer); LanguageAnalyser sharedGazetteer = (LanguageAnalyser)Factory.createResource( "gate.creole.gazetteer.SharedDefaultGazetteer", params);
The SharedDefaultGazetteer instance will re-use the FSM that was built by the mainGazetteer instead of loading its own.
13.11 Extended Gazetteer [#]
The ExtendedGazetteer is part of the StringAnnotation plugin and works similar to the default ANNIE Gazetteer PR, but with the following changes and additions:
-
it scales to very large gazetters by implementing an optimized compressed trie data structure which allows to fit huge gazetteer lists into memory. Unlike the LKB Gazetteer it still allows arbitrary features per entry.
-
it automatically properly supports duplication, the loaded gazetteer gets automatically shared between all duplicates.
-
the compressed data structre is cashed to disk, so loading the gazetteer is faster once the cashe file has been created
-
it can be used to match either the original document text (as the default ANNIE Gazetteer) or the "virtual document view" created from the values of some feature for some annotation type (similar to how the Flexible Gazetteer works)
-
it requires word annotations and non-word/space annotations (e.g. the Token/SpaceToken annotations created by ANNIE) and uses those to determine word boundaries in the document (instead of the whitespace patterns used by the default ANNIE gazetteer). In addition it is possible to define a "Split" annotation type so that a gazetteer match will never cross or include a split annotation (e.g. to prevent matches across sentence boundaries).
-
All files have to use UTF-8 encoding by default
-
The default feature separator is a tab character, not a colon
-
For case-insensitive matches, the upper-case version of strings from the gazetteer list and the document are compared, and the language for converting to upper case can be specified (e.g. in German, a ß character is converted to "SS").
More detailed documentation of the ExtendedGazetteer is available online at https://gatenlp.github.io/gateplugin-StringAnnotation/ExtendedGazetteer.
13.12 Feature Gazetteer [#]
The FeatureGazetteer is part of the StringAnnotation plugin and can be used to match annotation features against a gazetteer list and if there is a match perform one of the following actions:
-
add new features from the gazetteer entry to the annotation
-
remove that annotation
-
add a new annotation
More detailed documentation of the FeatureGazetteer is available online at https://gatenlp.github.io/gateplugin-StringAnnotation/FeatureGazetteer.
1An ontology resource is usually identified by an URI concatenated with a set of characters starting with ‘#’. This set of characters is called fragment identifier. For example, if the URI of a class representing GATE POS Tagger is: ’http://gate.ac.uk/ns/gate-ontology#POSTagger’, the fragment identifier will be ’POSTagger’.
2In previous versions of GATE the gazetteer took three separate parameters for the tokeniser, POS tagger and morphological analyser. Existing saved applications that use these parameters will still work in GATE 8.0.