GATE.ac.uk - sale/tao/splitch5.html

Chapter 5
Language Resources: Corpora, Documents and Annotations [#]

This chapter documents GATE’s model of corpora, documents and annotations on documents. Section 5.1 describes the simple attribute/value data model that corpora, documents and annotations all share. Section 5.2, Section 5.3 and Section 5.4 describe corpora, documents and annotations on documents respectively. Section 5.5 describes GATE’s support for diverse document formats, and Section 5.5.2 describes facilities for XML input/output.

5.1 Features: Simple Attribute/Value Data [#]

GATE has a single model for information that describes documents, collections of documents (corpora), and annotations on documents, based on attribute/value pairs. Attribute names are strings; values can be any Java object. The API for accessing this feature data is Java’s Map interface (part of the Collections API).

5.2 Corpora: Sets of Documents plus Features [#]

A Corpus in GATE is a Java Set whose members are Documents. Both Corpora and Documents are types of LanguageResource (LR); all LRs have a FeatureMap (a Java Map) associated with them that stored attribute/value information about the resource. FeatureMaps are also used to associate arbitrary information with ranges of documents (e.g. pieces of text) via the annotation model (see below).

Documents have a DocumentContent which is a text at present (future versions may add support for audiovisual content) and one or more AnnotationSets which are Java Sets.

5.3 Documents: Content plus Annotations plus Features [#]

Documents are modelled as content plus annotations (see Section 5.4) plus features (see Section 5.1). The content of a document can be any subclass of DocumentContent.

5.4 Annotations: Directed Acyclic Graphs [#]

Annotations are organised in graphs, which are modelled as Java sets of Annotation. Annotations may be considered as the arcs in the graph; they have a start Node and an end Node, an ID, a type and a FeatureMap. Nodes have pointers into the sources document, e.g. character oﬀsets.

5.4.1 Annotation Schemas [#]

Annotation schemas provide a means to deﬁne types of annotations in GATE. GATE uses the XML Schema language supported by W3C for these deﬁnitions. When using GATE Developer to create/edit annotations, a component is available (gate.gui.SchemaAnnotationEditor) which is driven by an annotation schema ﬁle. This component will constrain the data entry process to ensure that only annotations that correspond to a particular schema are created. (Another component allows unrestricted annotations to be created.)

Schemas are resources just like other GATE components. Below we give some examples of such schemas. Section 3.4.6 describes how to create new schemas. Note that each schema ﬁle deﬁnes a single annotation type, however it is possible to use include deﬁnitions in a schema to refer to other schemas in order to load a whole set of schemas as a group. The default schemas for ANNIE annotation types (deﬁned in resources/schema in the ANNIE plugin) give an example of this technique.

Date Schema

<?xml version="1.0"?>
<schema
xmlns="http://www.w3.org/2000/10/XMLSchema">
 <!-- XSchema deffinition for Date-->
  <element name="Date">
    <complexType>
      <attribute name="kind"  use="optional">
        <simpleType>
          <restriction base="string">
            <enumeration value="date"/>
            <enumeration value="time"/>
            <enumeration value="dateTime"/>
          </restriction>
        </simpleType>
    </attribute>
  </complexType>
 </element>
</schema>

Person Schema

<?xml version="1.0"?>
<schema
xmlns="http://www.w3.org/2000/10/XMLSchema">
    <!-- XSchema definition for Person-->
    <element name="Person" />
</schema>

Address Schema

<?xml version="1.0"?> <schema
xmlns="http://www.w3.org/2000/10/XMLSchema">
    <!-- XSchema definition for Address-->
    <element name="Address">
      <complexType>
        <attribute name="kind"  use="optional">
          <simpleType>
            <restriction base="string">
              <enumeration value="email"/>
              <enumeration value="url"/>
              <enumeration value="phone"/>
              <enumeration value="ip"/>
              <enumeration value="street"/>
              <enumeration value="postcode"/>
              <enumeration value="country"/>
              <enumeration value="complete"/>
            </restriction>
        </simpleType>
    </attribute>
  </complexType>
</element>
</schema>

5.4.2 Examples of Annotated Documents [#]

This section shows some simple examples of annotated documents.

This material is adapted from [Grishman 97], the TIPSTER Architecture Design document upon which GATE version 1 was based. Version 2 has a similar model, although annotations are now graphs, and instead of multiple spans per annotation each annotation now has a single start/end node pair. The current model is largely compatible with [Bird & Liberman 99], and roughly isomorphic with "stand-oﬀ markup" as latterly adopted by the SGML/XML community.

Each example is shown in the form of a table. At the top of the table is the document being annotated; immediately below the line with the document is a ruler showing the position (byte oﬀset) of each character (see TIPSTER Architecture Design Document).

Underneath this appear the annotations, one annotation per line. For each annotation is shown its Id, Type, Span (start/end oﬀsets derived from the start/end nodes), and Features. Integers are used as the annotation Ids. The features are shown in the form name = value.

The ﬁrst example shows a single sentence and the result of three annotation procedures: tokenization with part-of-speech assignment, name recognition, and sentence boundary recognition. Each token has a single feature, its part of speech (pos), using the tag set from the University of Pennsylvania Tree Bank; each name also has a single feature, indicating the type of name: person, company, etc.


Text

Cyndi savored the soup.

^0...^5...^10..^15..^20

Annotations

Id	Type	SpanStart	Span End	Features

1	token	0	5	pos=NP

2	token	6	13	pos=VBD

3	token	14	17	pos=DT

4	token	18	22	pos=NN

5	token	22	23

6	name	0	5	name_type=person

7	sentence	0	23

Table 5.1: Result of annotation on a single sentence

Annotations will typically be organized to describe a hierarchical decomposition of a text. A simple illustration would be the decomposition of a sentence into tokens. A more complex case would be a full syntactic analysis, in which a sentence is decomposed into a noun phrase and a verb phrase, a verb phrase into a verb and its complement, etc. down to the level of individual tokens. Such decompositions can be represented by annotations on nested sets of spans. Both of these are illustrated in the second example, which is an elaboration of our ﬁrst example to include parse information. Each non-terminal node in the parse tree is represented by an annotation of type parse.


Text

Cyndi savored the soup.

^0...^5...^10..^15..^20

Annotations

Id	Type	SpanStart	Span End	Features

1	token	0	5	pos=NP

2	token	6	13	pos=VBD

3	token	14	17	pos=DT

4	token	18	22	pos=NN

5	token	22	23

6	name	0	5	name_type=person

7	sentence	0	23	constituents=[1],[2],[3].[4],[5]

Table 5.2: Result of annotations including parse information

In most cases, the hierarchical structure could be recovered from the spans. However, it may be desirable to record this structure directly through a constituents feature whose value is a sequence of annotations representing the immediate constituents of the initial annotation. For the annotations of type parse, the constituents are either non-terminals (other annotations in the parse group) or tokens. For the sentence annotation, the constituents feature points to the constituent tokens. A reference to another annotation is represented in the table as "[ Annotation Id]"; for example, "[3]" represents a reference to annotation 3. Where the value of an feature is a sequence of items, these items are separated by commas. No special operations are provided in the current architecture for manipulating constituents. At a less esoteric level, annotations can be used to record the overall structure of documents, including in particular documents which have structured headers, as is shown in the third example (Table 5.3).


Text

To: All Barnyard Animals

^0...^5...^10..^15..^20.

From: Chicken Little

^25..^30..^35..^40..

Date: November 10,1194

...^50..^55..^60..^65.

Subject: Descending Firmament

.^70..^75..^80..^85..^90..^95

Priority: Urgent

.^100.^105.^110.

The sky is falling. The sky is falling.

....^120.^125.^130.^135.^140.^145.^150.

Annotations

Id	Type	SpanStart	Span End	Features

1	Addressee	4	24

2	Source	31	45

3	Date	53	69	ddmmyy=101194

4	Subject	78	98

5	Priority	109	115

6	Body	116	155

7	Sentence	116	135

8	Sentence	136	155

Table 5.3: Annotation showing overall document structure

If the Addressee, Source, ... annotations are recorded when the document is indexed for retrieval, it will be possible to perform retrieval selectively on information in particular ﬁelds. Our ﬁnal example (Table 5.4) involves an annotation which eﬀectively modiﬁes the document. The current architecture does not make any speciﬁc provision for the modiﬁcation of the original text. However, some allowance must be made for processes such as spelling correction. This information will be recorded as a correction feature on token annotations and possibly on name annotations:


Text

Topster tackles 2 terrorbytes.

^0...^5...^10..^15..^20..^25..

Annotations

Id	Type	SpanStart	Span End	Features

1	token	0	7	pos=NP correction=TIPSTER

2	token	8	15	pos=VBZ

3	token	16	17	pos=CD

4	token	18	29	pos=NNS correction=terabytes

5	token	29	30

Table 5.4: Annotation modifying the document

5.4.3 Creating, Viewing and Editing Diverse Annotation Types [#]

Note that annotation types should consist of a single word with no spaces. Otherwise they may not be recognised by other components such as JAPE transducers, and may create problems when annotations are saved as inline (‘Save Preserving Format’ in the context menu).

To view and edit annotation types, see Section 3.4. To add annotations of a new type, see Section 3.4.5. To add a new annotation schema, see Section 3.4.6.

5.5 Document Formats [#]

The following document formats are supported by GATE by default:

Plain Text
HTML
SGML
XML
RTF
Email
PDF (some documents)
Microsoft Oﬃce (some formats)
OpenOﬃce (some formats)
UIMA CAS XML format
CoNLL/IOB

Additional formats are provided by plugins – you must load the relevant plugin before attempting to parse these document types

Twitter JSON (in the Twitter plugin, see section 17.2)
GATE JSON (in the Format_JSON plugin, see section 23.30
DataSift JSON, a common format for social media data from http://datasift.com (in the Format_DataSift plugin, see section 23.32)
FastInfoset, a compressed binary encoding of GATE XML (in the Format_FastInfoset plugin, see section 23.29)
MediaWiki markup, as used by Wikipedia and many other public wiki sites (in the Format_MediaWiki plugin, see section 23.28)
The formats used by PubMed and the Cochrane collaboration for biomedical literature (in the Format_PubMed plugin, see section 23.27)
CSV ﬁles containing one column of text data and optionally additional columns of metadata (in the Format_CSV plugin, see section 23.33)

By default GATE will try and identify the type of the document, then strip and convert any markup into GATE’s annotation format. To disable this process, set the markupAware parameter on the document to false.

When reading a document of one of these types, GATE extracts the text between tags (where such exist) and create a GATE annotation ﬁlled as follows:

The name of the tag will constitute the annotation’s type, all the tags attributes will materialize in the annotation’s features and the annotation will span over the text covered by the tag. A few exceptions of this rule apply for the RTF, Email and Plain Text formats, which will be described later in the input section of these formats.

The text between tags is extracted and appended to the GATE document’s content and all annotations created from tags will be placed into a GATE annotation set named ‘Original markups’.

Example:

If the markup is like this:

<aTagName attrib1="value1" attrib2="value2" attrib3="value3"> A
piece of text</aTagName>

then the annotation created by GATE will look like:

annotation.type = "aTagName";
annotation.fm = {attrib1=value1;atrtrib2=value2;attrib3=value3};
annotation.start = startNode;
annotation.end = endNode;

The startNode and endNode are created from oﬀsets referring the beginning and the end of ‘A piece of text’ in the document’s content.

The documents supported by GATE have to be in one of the encodings accepted by Java. The most popular is the ‘UTF-8’ encoding which is also the most storage eﬃcient one for UNICODE. If, when loading a document in GATE the encoding parameter is set to ‘’(the empty string), then the default encoding of the platform will be used.

5.5.1 Detecting the Right Reader [#]

In order to successfully apply the document creation algorithm described above, GATE needs to detect the proper reader to use for each document format. If the user knows in advance what kind of document they are loading then they can specify the MIME type (e.g. text/html) using the init parameter mimeType, and GATE will respect this. If an explicit type is not given, GATE attempts to determine the type by other means, taking into consideration (where possible) the information provided by three sources:

Document’s extension
The web server’s content type
Magic numbers detection

The ﬁrst represents the extension of a ﬁle like (xml,htm,html,txt,sgm,rtf, etc), the second represents the HTTP information sent by a web server regarding the content type of the document being send by it (text/html; text/xml, etc), and the third one represents certain sequences of chars which are ultimately number sequences. GATE is capable of supporting multimedia documents, if the right reader is added to the framework. Sometimes, multimedia documents are identiﬁed by a signature consisting in a sequence of numbers. Inside GATE they are called magic numbers. For textual documents, certain char sequences form such magic numbers. Examples of magic numbers sequences will be provided in the Input section of each format supported by GATE.

All those tests are applied to each document read, and after that, a voting mechanism decides what is the best reader to associate with the document. There is a degree of priority for all those tests. The document’s extension test has the highest priority. If the system is in doubt which reader to choose, then the one associated with document’s extension will be selected. The next higher priority is given to the web server’s content type and the third one is given to the magic numbers detection. However, any two tests that identify the same mime type, will have the highest priority in deciding the reader that will be used. The web server test is not always successful as there might be documents that are loaded from a local ﬁle system, and the magic number detection test is not always applicable. In the next paragraphs we will se how those tests are performed and what is the general mechanism behind reader detection.

The method that detects the proper reader is a static one, and it belongs to the gate.DocumentFormat class. It uses the information stored in the maps ﬁlled by the init() method of each reader. This method comes with three signatures:

1static public DocumentFormat getDocumentFormat( gate.Document
2aGateDocument, URL url)
3
4static public DocumentFormat getDocumentFormat(gate.Document
5aGateDocument, String fileSuffix)
6
7static public DocumentFormat getDocumentFormat(gate.Document
8aGateDocument, MimeType mimeType)

The ﬁrst two methods try to detect the right MimeType for the GATE document, and after that, they call the third one to return the reader associate with a MimeType. Of course, if an explicit mimeType parameter was speciﬁed, GATE calls the third form of the method directly, passing the speciﬁed type. GATE uses the implementation from ‘http://jigsaw.w3.org’ for mime types.

The magic numbers test is performed using the information form
magic2mimeTypeMap map. Each key from this map, is searched in the ﬁrst buﬀerSize (the default value is 2048) chars of text. The method that does this is called
runMagicNumbers(InputStreamReader aReader) and it belongs to DocumentFormat class. More details about it can be found in the GATE API documentation.

In order to activate a reader to perform the unpacking, the creole deﬁnition of a GATE document deﬁnes a parameter called ‘markupAware’ initialized with a default value of true. This parameter, forces GATE to detect a proper reader for the document being read. If no reader is found, the document’s content is load and presented to the user, just like any other text editor (this for textual documents).

You can also use Tika format auto-detection by setting the mimeType of a document to "application/tika". Then the document will be parsed only by Tika.

The next subsections investigates particularities for each format and will describe the ﬁle extensions registered with each document format.

5.5.2 XML [#]

Input [#]

GATE permits the processing of any XML document and oﬀers support for XML namespaces. It beneﬁts the power of Apache’s Xerces parser and also makes use of Sun’s JAXP layer. Changing the XML parser in GATE can be achieved by simply replacing the value of a Java system property (‘javax.xml.parsers.SAXParserFactory’).

GATE will accept any well formed XML document as input. Although it has the possibility to validate XML documents against DTDs it does not do so because the validating procedure is time consuming and in many cases it issues messages that are annoying for the user.

There is an open problem with the general approach of reading XML, HTML and SGML documents in GATE. As we previously said, the text covered by tags/elements is appended to the GATE document content and a GATE annotation refers to this particular span of text. When appending, in cases such as ‘end.</P><P>Start’ it might happen that the ending word of the previous annotation is concatenated with the beginning phrase of the annotation currently being created, resulting in a garbage input for GATE processing resources that operate at the text surface.

Let’s take another example in order to better understand the problem:

<title>This is a title</title><p>This is a paragraph</p><a
href="#link">Here is an useful link</a>

When the markup is transformed to annotations, it is likely that the text from the document’s content will be as follows:

This is a titleThis is a paragraphHere is an useful link

The annotations created will refer the right parts of the texts but for the GATE’s processing resources like (tokenizer, gazetteer, etc) which work on this text, this will be a major disaster. Therefore, in order to prevent this problem from happening, GATE checks if it’s likely to join words and if this happens then it inserts a space between those words. So, the text will look like this after loaded in GATE Developer:

This is a title This is a paragraph Here is an useful link

There are cases when these words are meant to be joined, but they are rare. This is why it’s an open problem. If you need to disable these spaces in GATE Developer, select Options, Conﬁguration, and then the Advanced tab in the conﬁguration dialog; untick the box beside Add space on markup unpack if needed. You can re-enable the spaces later if you wish. This option will persist between sessions if Save options on exit (in the same dialog) is turned on.

Programmatically, this can be controlled with the following code:

Gate.getUserConfig().put(GateConstants.DOCUMENT_ADD_SPACE_ON_UNPACK_FEATURE_NAME, enabled);

where enabled is a boolean or Boolean.

The extensions associate with the XML reader are:

xml
xhtm
xhtml

The web server content type associate with xml documents is: text/xml.

The magic numbers test searches inside the document for the XML(<?xml version="1.0") signature. It is also able to detect if the XML document uses the semantics described in the GATE document format DTD (see 1 below) or uses other semantics.

Namespace handling

By default, GATE will retain the namespace preﬁx and namespace URIs of XML elements when creating annotations and features within the Original markups annotation set. For example, the element

<dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Document title</dc:title>

will create the following annotation

dc:title(xmlns:dc=http://purl.org/dc/elements/1.1/)

However, as the colon character ’:’ is a reserved meta-character in JAPE, it is not possible to write a JAPE rule that will match the dc:title element or its namespace URI.

If you need to match namespace-preﬁxed elements in the Original markups AS, you can alter the default namespace deserialization behaviour to remove the namespace preﬁx and add it as a feature (along with the namespace URI), by specifying the following attributes in the <GATECONFIG> element of gate.xml or local conﬁguration ﬁle:

addNamespaceFeatures - set to "true" to deserialize namespace preﬁx and uri information as features.
namespaceURI - The feature name to use that will hold the namespace URI of the element, e.g. "namespace"
namespacePreﬁx - The feature name to use that will hold the namespace preﬁx of the element, e.g. "preﬁx"

i.e.

<GATECONFIG
addNamespaceFeatures="true"
namespaceURI="namespace"
namespacePrefix="prefix" />

For example

<dc:title>Document title</dc:title>

would create in Original markups AS (assuming the xmlns:dc URI has deﬁned in the document root or parent element)

title(prefix=dc, namespace=http://purl.org/dc/elements/1.1/)

If a JAPE rule is written to create a new annotation, e.g.

description(prefix=foo, namespace=http://www.example.org/)

then these would be serialized to

<dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Document title</dc:title>
<foo:description xmlns:foo="http://www.example.org/">...</foo:description>

when using the ’Save preserving document format’ XML output option (see 1 below).

Output [#]

GATE is capable of ensuring persistence for its resources. The types of persistent storage used for Language Resources are:

Java serialization;
XML serialization.

We describe the latter case here.

XML persistence doesn’t necessarily preserve all the objects belonging to the annotations, documents or corpora. Their features can be of all kinds of objects, with various layers of nesting. For example, lists containing lists containing maps, etc. Serializing these arbitrary data types in XML is not a simple task; GATE does the best it can, and supports native Java types such as Integers and Booleans, but where complex data types are used, information may be lost(the types will be converted into Strings). GATE provides a full serialization of certain types of features such as collections, strings and numbers. It is possible to serialize only those collections containing strings or numbers. The rest of other features are serialized using their string representation and when read back, they will be all strings instead of being the original objects. Consequences of this might be observed when performing evaluations (see Chapter 10).

When GATE outputs an XML document it may do so in one of two ways:

When the original document that was imported into GATE was an XML document, GATE can dump that document back into XML (possibly with additional markup added);
For all document formats, GATE can dump its internal representation of the document into XML.

In the former case, the XML output will be close to the original document. In the latter case, the format is a GATE-speciﬁc one which can be read back by the system to recreate all the information that GATE held internally for the document.

In order to understand why there are two types of XML serialization, one needs to understand the structure of a GATE document. GATE allows a graph of annotations that refer to parts of the text. Those annotations are grouped under annotation sets. Because of this structure, sometimes it is impossible to save a document as XML using tags that surround the text referred to by the annotation, because tags crossover situations could appear (XML is essentially a tree-based model of information, whereas GATE uses graphs). Therefore, in order to preserve all annotations in a GATE document, a custom type of XML document was developed.

The problem of crossover tags appears with GATE’s second option (the preserve format one), which is implemented at the cost of losing certain annotations. The way it is applied in GATE is that it tries to restore the original markup and where it is possible, to add in the same manner annotations produced by GATE.

How to Access and Use the Two Forms of XML Serialization

Save as XML Option [#]

This option is available in GATE Developer in the pop-up menu associated with each language resource (document or corpus). Saving a corpus as XML is done by calling ‘Save as XML’ on each document of the corpus. This option saves all the annotations of a document together their features(applying the restrictions previously discussed), using the GateDocument.dtd :

 <!ELEMENT GateDocument (GateDocumentFeatures,
           TextWithNodes, (AnnotationSet+))>
 <!ELEMENT GateDocumentFeatures (Feature+)>
 <!ELEMENT Feature (Name, Value)>
 <!ELEMENT Name (\#PCDATA)>
 <!ELEMENT Value (\#PCDATA)>
 <!ELEMENT TextWithNodes (\#PCDATA | Node)*>
 <!ELEMENT AnnotationSet (Annotation*)>
 <!ATTLIST AnnotationSet  Name CDATA \#IMPLIED>
 <!ELEMENT Annotation (Feature*)>
 <!ATTLIST Annotation  Type      CDATA \#REQUIRED
                       StartNode CDATA \#REQUIRED
                       EndNode   CDATA \#REQUIRED>
 <!ELEMENT Node EMPTY>
 <!ATTLIST Node id CDATA \#REQUIRED>

The document is saved under a name chosen by the user and it may have any extension. However, the recommended extension would be ‘xml’.

Using GATE Embedded, this option is available by calling gate.Document’s toXml() method. This method returns a string which is the XML representation of the document on which the method was called.

Note: It is recommended that the string representation to be saved on the ﬁle system using the UTF-8 encoding, as the ﬁrst line of the string is : <?xml version="1.0" encoding="UTF-8"?>

Example of such a GATE format document:

<?xml version="1.0" encoding="UTF-8" ?>
<GateDocument>

<!-- The document’s features-->

<GateDocumentFeatures>
<Feature>
  <Name className="java.lang.String">MimeType</Name>
  <Value className="java.lang.String">text/plain</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">gate.SourceURL</Name>
  <Value className="java.lang.String">file:/G:/tmp/example.txt</Value>
</Feature>
</GateDocumentFeatures>

<!-- The document content area with serialized nodes -->

<TextWithNodes>
<Node id="0"/>A TEENAGER <Node
id="11"/>yesterday<Node id="20"/> accused his parents of cruelty
by feeding him a daily diet of chips which sent his weight
ballooning to 22st at the age of l2<Node id="146"/>.<Node
id="147"/>
</TextWithNodes>

<!-- The default annotation set -->

<AnnotationSet>
<Annotation Type="Date" StartNode="11"
EndNode="20">
<Feature>
  <Name className="java.lang.String">rule2</Name>
  <Value className="java.lang.String">DateOnlyFinal</Value>
</Feature> <Feature>
  <Name className="java.lang.String">rule1</Name>
  <Value className="java.lang.String">GazDateWords</Value>
</Feature> <Feature>
  <Name className="java.lang.String">kind</Name>
  <Value className="java.lang.String">date</Value>
</Feature> </Annotation> <Annotation Type="Sentence" StartNode="0"
EndNode="147"> </Annotation> <Annotation Type="Split"
                                                                                         
                                                                                         
StartNode="146" EndNode="147"> <Feature>
  <Name className="java.lang.String">kind</Name>
  <Value className="java.lang.String">internal</Value>
</Feature> </Annotation> <Annotation Type="Lookup" StartNode="11"
EndNode="20"> <Feature>
  <Name className="java.lang.String">majorType</Name>
  <Value className="java.lang.String">date_key</Value>
</Feature> </Annotation>
</AnnotationSet>

<!-- Named annotation set -->

<AnnotationSet Name="Original markups" >
 <Annotation
Type="paragraph" StartNode="0" EndNode="147"> </Annotation>
</AnnotationSet>
</GateDocument>

Note: One must know that all features that are not collections containing numbers or strings or that are not numbers or strings are discarded. With this option, GATE does not preserve those features it cannot restore back.

The Preserve Format Option This option is available in GATE Developer from the popup menu of the annotations table. If no annotation in this table is selected, then the option will restore the document’s original markup. If certain annotations are selected, then the option will attempt to restore the original markup and insert all the selected ones. When an annotation violates the crossed over condition, that annotation is discarded and a message is issued.

This option makes it possible to generate an XML document with tags surrounding the annotation’s referenced text and features saved as attributes. All features which are collections, strings or numbers are saved, and the others are discarded. However, when read back, only the attributes under the GATE namespace (see below) are reconstructed back diﬀerently to the others. That is because GATE does not store in the XML document the information about the features class and for collections the class of the items. So, when read back, all features will become strings, except those under the GATE namespace.

One will notice that all generated tags have an attribute called ‘gateId’ under the namespace ‘http://www.gate.ac.uk’. The attribute is used when the document is read back in GATE, in order to restore the annotation’s old ID. This feature is needed because it works in close cooperation with another attribute under the same namespace, called ‘matches’. This attribute indicates annotations/tags that refer the same entity¹. They are under this namespace because GATE is sensitive to them and treats them diﬀerently to all other elements with their attributes which fall under the general reading algorithm described at the beginning of this section.

The ‘gateId’ under GATE namespace is used to create an annotation which has as ID the value indicated by this attribute. The ‘matches’ attribute is used to create an ArrayList in which the items will be Integers, representing the ID of annotations that the current one matches.

Example:

If the text being processed is as follows:

<Person gate:gateId="23">John</Person> and <Person
gate:gateId="25" gate:matches="23;25;30">John Major</Person> are
the same person.

What GATE does when it parses this text is it creates two annotations:

a1.type = "Person"
a1.ID = Integer(23)
a1.start = <the start offset of
John>
a1.end = <the end offset of John>
a1.featureMap = {}

a2.type = "Person"
a2.ID = Integer(25)
a2.start = <the start offset
of John Major>
a2.end = <the end offset of John Major>
a2.featureMap = {matches=[Integer(23); Integer(25); Integer(30)]}

Under GATE Embedded, this option is available by calling gate.Document’s toXml(Set aSetContainingAnnotations) method. This method returns a string which is the XML representation of the document on which the method was called. If called with null as a parameter, then the method will attempt to restore only the original markup. If the parameter is a set that contains annotations, then each annotation is tested against the crossover restriction, and for those found to violate it, a warning will be issued and they will be discarded.

In the next subsections we will show how this option applies to the other formats supported by GATE.

5.5.3 HTML [#]

Input

HTML documents are parsed by GATE using the NekoHTML parser. The documents are read and created in GATE the same way as the XML documents.

The extensions associate with the HTML reader are:

htm
html

The web server content type associate with html documents is: text/html.

The magic numbers test searches inside the document for the HTML(<html) signature.There are certain HTML documents that do not contain the HTML tag, so the magical numbers test might not hold.

There is a certain degree of customization for HTML documents in that GATE introduces new lines into the document’s text content in order to obtain a readable form. The annotations will refer the pieces of text as described in the original document but there will be a few extra new line characters inserted.

After reading H1, H2, H3, H4, H5, H6, TR, CENTER, LI, BR and DIV tags, GATE will introduce a new line (NL) char into the text. After a TITLE tag it will introduce two NLs. With P tags, GATE will introduce one NL at the beginning of the paragraph and one at the end of the paragraph. All newly added NLs are not considered to be part of the text contained by the tag.

Output

The ‘Save as XML’ option works exactly the same for all GATE’s documents so there is no particular observation to be made for the HTML formats.

When attempting to preserve the original markup formatting, GATE will generate the document in xhtml. The html document will look the same with any browser after processed by GATE but it will be in another syntax.

5.5.4 SGML [#]

Input

The SGML support in GATE is fairly light as there is no freely available Java SGML parser. GATE uses a light converter attempting to transform the input SGML ﬁle into a well formed XML. Because it does not make use of a DTD, the conversion might not be always good. It is advisable to perform a SGML2XML conversion outside the system(using some other specialized tools) before using the SGML document inside GATE.

The extensions associate with the SGML reader are:

sgm
sgml

The web server content type associate with xml documents is : text/sgml.

There is no magic numbers test for SGML.

Output

When attempting to preserve the original markup formatting, GATE will generate the document as XML because the real input of a SGML document inside GATE is an XML one.

5.5.5 Plain text [#]

Input

When reading a plain text document, GATE attempts to detect its paragraphs and add ‘paragraph’ annotations to the document’s ‘Original markups’ annotation set. It does that by detecting two consecutive NLs. The procedure works for both UNIX like or DOS like text ﬁles.

Example:

If the plain text read is as follows:

Paragraph 1. This text belongs to the first paragraph.

Paragraph 2. This text belongs to the second paragraph

then two ‘paragraph’ type annotation will be created in the ‘Original markups’ annotation set (referring the ﬁrst and second paragraphs ) with an empty feature map.

The extensions associate with the plain text reader are:

txt
text

The web server content type associate with plain text documents is: text/plain.

There is no magic numbers test for plain text.

Output

When attempting to preserve the original markup formatting, GATE will dump XML markup that surrounds the text refereed.

The procedure described above applies both for plain text and RTF documents.

5.5.6 RTF [#]

Input

Accessing RTF documents is performed by using the Java’s RTF editor kit. It only extracts the document’s text content from the RTF document.

The extension associate with the RTF reader is ‘rtf’.

The web server content type associate with xml documents is : text/rtf.

The magic numbers test searches for {∖∖rtf1.

Output

Same as the plain tex output.

5.5.7 Email [#]

Input

GATE is able to read email messages packed in one document (UNIX mailbox format). It detects multiple messages inside such documents and for each message it creates annotations for all the ﬁelds composing an e-mail, like date, from, to, subject, etc. The message’s body is analyzed and a paragraph detection is performed (just like in the plain text case) . All annotation created have as type the name of the e-mail’s ﬁelds and they are placed in the Original markup annotation set.

Example:

From someone@zzz.zzz.zzz Wed Sep  6 10:35:50 2000

Date: Wed, 6 Sep2000 10:35:49 +0100 (BST)

From: forename1 surname2 <someone1@yyy.yyy.xxx>

To: forename2 surname2 <someone2@ddd.dddd.dd.dd>

Subject: A subject

Message-ID: <Pine.SOL.3.91.1000906103251.26010A-100000@servername>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

This text belongs to the e-mail body....

This is a paragraph in the body of the e-mail

This is another paragraph.

GATE attempts to detect lines such as ‘From someone@zzz.zzz.zzz Wed Sep 6 10:35:50 2000’ in the e-mail text. Those lines separate e-mail messages contained in one ﬁle. After that, for each ﬁeld in the e-mail message annotations are created as follows:

The annotation type will be the name of the ﬁeld, the feature map will be empty and the annotation will span from the end of the ﬁeld until the end of the line containing the e-mail ﬁeld.

Example:

a1.type = "date" a1 spans between the two ^ ^. Date:^ Wed,
6Sep2000 10:35:49 +0100 (BST)^

a2.type = "from"; a2 spans between the two ^ ^. From:^ forename1
surname2 <someone1@yyy.yyy.xxx>^

The extensions associated with the email reader are:

eml
email
mail

The web server content type associate with plain text documents is: text/email.

The magic numbers test searches for keywords like Subject:,etc.

Output

Same as plain text output.

5.5.8 PDF Files and Oﬃce Documents [#]

GATE uses the Apache Tika library to provide support for PDF documents and a number of the document formats from both Microsoft Oﬃce and OpenOﬃce. In essense Tika converts the document structure into HTML which is then used to create a GATE document. This means that whilst a PDF or Word document may have been loaded the “Original markups” set will contain HTML elements. One advantage of this approach is that processing resources and JAPE grammars designed for use with HTML ﬁles should also work well with PDF and Oﬃce documents.

5.5.9 UIMA CAS Documents [#]

GATE can read UIMA CAS documents. The CAS stands for Common Analysis Structure. It provides a common representation to the artifact being analyzed, here a text.

The subject of analysis (SOFA), here a string, is used as the document content. Multiple sofa are concatenated. The analysis results or metadata are added as annotations when having begin and end oﬀsets and otherwise are added as document features. The views are added as GATE annotation sets. The type system (a hierarchical annotation schema) is not currently supported.

The web server content type associate with UIMA documents is: text/xmi+xml.

The extensions are: xcas, xmicas, xmi.

The magic numbers are:

<CAS version="2">

and

xmlns:cas=

5.5.10 CoNLL/IOB Documents [#]

GATE can read ﬁles of text annotated in the traditional CoNLL or BIO/BILOU format, typically used to represent POS tags and chunks and best known for Conference on Natural Language Learning² tasks. The following example illustrates one sentence with POS and chunk tags (B- and I- indicate the beginning and continuation, respectively, of a chunk); the columns represent the tokens, the POS tags, and the chunk tags, and sentences are separated by blank lines.

My    PRP$  B-NP
dog   NN    I-NP
has   VBZ   B-VP
fleas NNS   B-NP
.     .     O

GATE interpets this format quite ﬂexibly: the columns can be separated by any whitespace sequence, and the number of columns can vary. The strings from the leftmost column become strings in the document content, with spaces interposed, and Token and SpaceToken annotations (with string and length features) are created appropriately in the Original markups set).

Each blank line (empty or containing only whitespace) in the original data becomes a newline in the document content.

The tags in subsequent columns are transformed into annotations. A chunk tag (beginning with B- and followed by zero or more matching I- tags) produces an annotation whose type is determined by the rest of the tag (NP or VP in the above example, but any string with no whitespace is acceptable), with a kind = chunk feature. A chunk tag beginning with L- (last) terminates the chunk, and a U- (unigram) tag produces a chunk annotation over one token. Other tags produce annotations with the tag name as the type and a kind = token feature.

Every annotation derived from a tag has a column feature whose int value indicates the source column in the data (numbered from 0 for the string column). An “O” tag closes all open chunk tags at the end of the previous token.

This document format is associated with MIME-type text/x-conll and ﬁlename extensions .conll and .iob.

5.6 XML Input/Output [#]

Support for input from and output to XML is described in Section 5.5.2. In short:

GATE will read any well-formed XML document (it does not attempt to validate XML documents). Markup will by default be converted into native GATE format.
GATE will write back into XML in one of two ways:
1. Preserving the original format and adding selected markup (for example to add the results of some language analysis process to the document).
2. In GATE’s own XML serialisation format, which encodes all the data in a GATE Document (as far as this is possible within a tree-structured paradigm – for 100% non-lossy data storage use GATE’s RDBMS or binary serialisation facilities – see Section 4.5).

When using GATE Embedded, object representations of XML documents such as DOM or jDOM, or query and transformation languages such as X-Path or XSLT, may be used in parallel with GATE’s own Document representation (gate.Document) without conﬂicts.

¹It’s not an XML entity but a information extraction named entity

² http://ifarm.nl/signll/conll/

[next] [prev] [prev-tail] [front] [up]