GATE.ac.uk - releases/gate-5.1-beta1-build3397-ALL/doc/tao/splitch9.html

Chapter 9
ANNIC: ANNotations-In-Context [#]

ANNIC (ANNotations-In-Context) is a full-featured annotation indexing and retrieval system. It is provided as part of an extension of the Serial Data-stores, called Searchable Serial Data-store (SSD).

ANNIC can index documents in any format supported by the GATE system (i.e., XML, HTML, RTF, e-mail, text, etc). Compared with other such query systems, it has additional features addressing issues such as extensive indexing of linguistic information associated with document content, independent of document format. It also allows indexing and extraction of information from overlapping annotations and features. Its advanced graphical user interface provides a graphical view of annotation markups over the text, along with an ability to build new queries interactively. In addition, ANNIC can be used as a first step in rule development for NLP systems as it enables the discovery and testing of patterns in corpora.

ANNIC is built on top of the Apache Lucene¹ – a high performance full-featured search engine implemented in Java, which supports indexing and search of large document collections. Our choice of IR engine is due to the customisability of Lucene. For more details on how Lucene was modified to meet the requirements of indexing and querying annotations, please refer to [Aswani et al. 05].

As explained earlier, SSD is an extension of the serial data-store. In addition to the persist location, SSD asks user to provide some more information (explained later) that it uses to index the documents. Once the SSD has been initiated, user can add/remove documents/corpora to the SSD in a similar way it is done with other data-stores. When documents are added to the SSD, it automatically tries to index them. It updates the index whenever there is a change in any of the documents stored in the SSD and removes the document from the index if it is deleted from the SSD. Be warned that only the annotation sets, types and features initially provided during the SSD creation time, will be updated when adding/removing documents to the datastore.

SSD has an advanced graphical interface that allows users to issue queries over the SSD. Below we explain the parameters required by SSD and how to instantiate it, how to use its graphical interface and how to use SSD programmatically.

9.1 Instantiating SSD [#]

Steps:

In GATE Developer, right click on “Data Stores” and select “Create datastore”.
From a drop-down list select “Lucene Based Searchable DataStore”.
Here, you will see an input window. Please provide these parameters:
1. DataStore URL: Select an empty folder where the DS is created.
2. Index Location: Select an empty folder. This is where the index will be created.
3. Annotation Sets: Here, you can provide one or more annotation sets that you wish to index or exclude from being indexed. In order to be able to index the default annotation set, you must click on the edit list icon and add an empty field to the list. If there are no annotation sets provided, all the annotation sets in all documents are indexed.
4. Base-Token Type: (e.g. Token or Key.Token) These are the basic tokens of any document. Your documents must have the annotations of Base-Token-Type in order to get indexed. These basic tokens are used for displaying contextual information while searching patterns in the corpus. In case of indexing more than one annotation set, user can specify the annotation set from which the tokens should be taken (e.g. Key.Token- annotations of type Token from the annotation set called Key). In case user does not provide any annotation set name (e.g. Token), the system searches in all the annotation sets to be indexed and the base-tokens from the first annotation set with the base token annotations are taken. Please note that the documents with no base-tokens are not indexed. However, if the ”create tokens automatically” option is selected, the SSD creates base-tokens automatically. Here, each string delimited with white space is considered as a token.
5. Index Unit Type: (e.g. Sentence, Key.Sentence) This specifies the unit of Index. In other words, annotations lying within the boundaries of these annotations are indexed (e.g. in the case of “Sentences”, no annotations that are spanned across the boundaries of two sentences are considered for indexing). User can specify from which annotation set the index unit annotations should be considered. If user does not provide any annotation set, the SSD searches among all annotation sets for index units. If this field is left empty or SSD fails to locate index units, the entire document is considered as a single unit.
6. Features: Finally, users can specify the annotation types and features that should be indexed or excluded from being indexed. (e.g. SpaceToken and Split). If user wants to exclude only a specific feature of a specific annotation type, he/she can specify it using a ’.’ separator between the annotation type and its feature (e.g. Person.matches).
Click OK. If all parameters are OK, a new empty DS will be created.
Create an empty corpus and save it to the SSD.
Populate it with some documents. Each document added to the corpus and eventually to the SSD is indexed automatically. If the document does not have the required annotations, that document is skipped and not indexed.

9.2 Search GUI [#]

Figure 9.1:

Searchable Serial Datastore Viewer.

9.2.1 Overview

Figure 9.1 gives a snapshot of the GUI. The top section contains a text area to write a query, options to select the input data and the output format and two icons to execute and delete a query. The central section shows a graphical visualisation of annotations and values of the result selected in the bottom results table. You can also see the annotation rows manager window where you define which annotation type and feature to display in the central section. The bottom section contains the results table of the query, i.e. the text that matches the query with their left and right contexts, the annotation set and the document. The bottom section contains also a tabbed panes of statistics.

9.2.2 Syntax of Queries [#]

SSD enables you to formulate versatile queries using JAPE patterns. JAPE patterns support various query formats. Below, we give examples of JAPE pattern clauses which can be used as SSD queries. Actual queries can also be a combination of one or more of the following pattern clauses:

String
{AnnotationType}
{AnnotationType == String}
{AnnotationType.feature == feature value}
{AnnotationType1, AnnotationType2.feature == featureValue}
{AnnotationType1.feature == featureValue, AnnotationType2.feature == featureValue}

JAPE patterns also support the | (OR) operator. For instance, {A} ({B}|{C}) is a pattern of two annotations where the first is an annotation of type A followed by the annotation of type either B or C. ANNIC supports two operators, + and *, to specify the number of times a particular annotation or a sub pattern should appear in the main query pattern. Here, ({A})+n means one and up to n occurrences of annotation {A} and ({A})*n means zero or up to n occurrences of annotation {A}.

Below we explain the steps to search in SSD.

Double click on SSD. You will see an extra tab “Lucene DataStore Searcher”. Click on it to activate the searcher GUI.
Here you can specify a query to search in your SSD. The query here is a L.H.S. part of the JAPE grammar. Please refer to the following example queries:
1. {Person} – This will return annotations of type Person from the SSD
2. {Token.string == “Microsoft”} – This will return all occurrences of “Microsoft” from the SSD.
3. {Person}({Token})*2{Organization} – Person followed by zero or upto two tokens followed by Organization.
4. {Token.orth==“upperInitial”, Organization} – Token with feature orth with value set to “upperInitial” and which is also annotated as Organization.

Figure 9.2:

Searchable Serial Datastore Viewer - Auto-completion.

9.2.3 Top Section [#]

A text-area located in the top left part of the GUI is used to input a query. You can copy/cut/paste with Control+C/X/V, undo/redo your changes with Control+Z/Y as usual. To add a new line, use Control+Enter combination keys.

Auto-completion shown on the figure 9.2 for annotation type is triggered when typing ’{’ and for feature when typing ’.’ after a valid annotation type. It shows only the annotation types and features related to the selected corpus and annotation set. If you right-click on an expression it will automatically select the shortest valid enclosing brace and if you click on a selection it will propose you to add quantifiers for allowing the expression to appear zero, one or more times.

To execute the query, click on the magnifying glass icon, use Enter key or Alt+Enter combination keys. To delete the query, click on the trash icon or use Alt+Backspace combination keys.

It is possible to have more than one corpus, each containing a different set of documents, stored in a single data-store. ANNIC, by providing a drop down box with a list of stored corpora, also allows searching within a specific (selected) corpus. Similarly a document can have more than one annotation set indexed and therefore ANNIC also provides a drop down box with a list of indexed annotation sets for the selected corpus.

A large corpus can have many hits for a given query. This may take a long time to refresh the GUI and may create inconvenience while browsing through patterns. ANNIC therefore allows you to specify a number of patterns that you wish to retrieve at once and provides a way to iterate through next pages with the Next Page of Results button. Due to technical complexities, it is not possible to visit a previous page. It is however possible to tick a check-box for retrieving all the results at the same time.

9.2.4 Central Section [#]

Annotation types and features to show can be selected from the annotation rows manager by clicking on the Modify Rows button in the central section. When you choose to show a feature of an annotation (e.g. feature category for annotation type Token), the central section shows colored rectangles exactly below the spans of text where these annotations occur in the selected pattern. If user only selects an annotation type, the rantangle remains empty. When the user hovers his/her mouse over the rectangle, it shows all their features in a popup window. If the user selects both, annotation type and a feature, the value of that feature is shown in the rectangle.

Shortcuts are expression that stand for an ”AnnotationType.Feature” expression. For example, on the figure 9.1, the shortcut ”POS” stands for the expression ”Token.category”. The purpose is to make the query more readable.

When you left-clicks on any of the rectangles of the annotations rows, the respective query expression is placed at the caret position in the query text area. If user has selected anything in the query text area, it gets replaced. You can also click on a word on the first line to add it to the query.

9.2.5 Bottom Section [#]

In the table of results, ANNIC shows patterns retrieved from the SSD and shows the query that the selected pattern refers to.

Along with its left and right context texts, it also lists the names of the document and the annotation set that the patterns come from. When the focus changes from one row to another, the central section is updated accordingly. You can sort a table column by clicking on its header.

You can remove a result from the results table or open the document containing it by right-clicking on a result in the results table.

ANNIC provides an Export button to export results into an HTML file. User can either export all the results or the selected ones by selecting the relevant rows in the table of results.

A statistics tabbed pane can be displayed on the bottom-right by clicking on the Statistics button. There is always a global statistics pane that list the count of the occurrences of all annotation types for the selected corpus and annotation set.

Statistics can be obtained for matched spans of the query in the results, with or without contexts, just by annotation type, an annotation type + feature or an annotation type + feature + value. A second pane contains the one item statistics that you can add by right-clicking on a non empty rectangle or on the header of a row in the central section. You can sort a table column by clicking on its header.

9.3 Using SSD from GATE Embedded [#]

//how to instantiate a searchabledatastore
===============================

// create an instance of datastore
LuceneDataStoreImpl ds = (LuceneDataStoreImpl)
Factory.createDataStore(‘‘gate.persist.LuceneDataStoreImpl’’, dsLocation);

// we need to set Indexer
Indexer indexer = new LuceneIndexer(new URL(indexLocation));

// set the parameters
Map parameters = new HashMap();

// specify the index url
parameters.put(Constants.INDEX_LOCATION_URL, new URL(indexLocation));

// specify the base token type
// and specify that the tokens should be created automatically
// if not found in the document
parameters.put(Constants.BASE_TOKEN_ANNOTATION_TYPE, ‘‘Token’’);
parameters.put(Constants.CREATE_TOKENS_AUTOMATICALLY, new Boolean(true));

// specify the index unit type
parameters.put(Constants.INDEX_UNIT_ANNOTATION_TYPE, ‘‘Sentence’’);

// specifying the annotation sets "Key" and "Default Annotation Set"
// to be indexed
List<String> setsToInclude = new ArrayList<String>();
setsToInclude.add("Key");
setsToInclude.add("<null>");
parameters.put(Constants.ANNOTATION_SETS_NAMES_TO_INCLUDE, setsToInclude);
parameters.put(Constants.ANNOTATION_SETS_NAMES_TO_EXCLUDE, new ArrayList<String>());

// all features should be indexed
parameters.put(Constants.FEATURES_TO_INCLUDE, new ArrayList<String>());
parameters.put(Constants.FEATURES_TO_EXCLUDE, new ArrayList<String>());

// set the indexer
ds.setIndexer(indexer, parameters);

// set the searcher
ds.setSearcher(new LuceneSearcher());

//how to search in this datastore
//======================
// obtain the searcher instance
Searcher searcher = ds.getSearcher();
Map parameters = new HashMap();

// obtain the url of index
String indexLocation =
new File(((URL) ds.getIndexer().getParameters().get(Constants.INDEX_LOCATION_URL))
.getFile()).getAbsolutePath();
ArrayList indexLocations = new ArrayList();
indexLocations.add(indexLocation);

// corpus2SearchIn = mention corpus name that was indexed here.

// the annotation set to search in
String annotationSet2SearchIn = "Key";

// set the parameter
parameters.put(Constants.INDEX_LOCATIONS,indexLocations);
parameters.put(Constants.CORPUS_ID, corpus2SearchIn);
parameters.put(Constants.ANNOTATION_SET_ID, annotationSet);
parameters.put(Constants.CONTEXT_WINDOW, contextWindow);
parameters.put(Constants.NO_OF_PATTERNS, noOfPatterns);

// search
String query = ‘‘{Person}’’;
Hit[] hits = searcher.search(query, parameters);

¹http://lucene.apache.org

[next] [prev] [prev-tail] [front] [up]