Why has the pleasure of slowness disappeared? Ah, where have they gone, the amblers of yesteryear? Where have they gone, those loafing heroes of folk song, those vagabonds who roam from one mill to another and bed down under the stars? Have they vanished along with footpaths, with grasslands and clearings, with nature? There is a Czech proverb that describes their easy indolence by a metaphor: ‘they are gazing at God’s windows.’ A person gazing at God’s windows is not bored; he is happy. In our world, indolence has turned into having nothing to do, which is a completely different thing: a person with nothing to do is frustrated, bored, is constantly searching for an activity he lacks.
Slowness, Milan Kundera, 1995 (pp. 4-5).
GATE is a backplane into which specialised Java Beans plug. These beans are loose-coupled with respect to each other - they communicate entirely by means of the GATE framework. Inter-component communication is handled by model components - LanguageResources, and events.
Components are defined by conformance to various interfaces (e.g. LanguageResource), ensuring separation of interface and implementation.
The reason for adding to the normal bean initialisation mech is that LRs, PRs and VRs all have characteristic parameterisation phases; the GATE resources/components model makes explicit these phases.
GATE is structured around a number of what we might call principles, or patterns, or alternatively, clever ideas stolen from better minds than mine. These patterns are:
Four interfaces in the top-level package describe the GATE view of components: Resource, ProcessingResource, LanguageResource and VisualResource.
Wherever users of the architecture may wish to extend the set of a particular type of entity, those types should be expressed as components.
Another way to express this is to say that the architecture is based on agents. I’ve avoided this in the past because of an association between this term and the idea of bits of code moving around between machines of their own volition. I take this to be somewhat pointless, and probably the result of an anthropomorphic obsession with mobility as a correlate of intelligence. If we drop this connotation, however, we can say that GATE is an agent-based architecture. If we want to, that is.
Many of the classes in the framework are components, by which we mean classes that conform to an interface with certain standard properties. In our case these properties are based on the Java Beans component architecture, with the addition of component metadata, automated loading and standardised storage, threading and distribution.
All components inherit from Resource, via one of the three sub-interfaces LanguageResource (LR), VisualResource (VR) or ProcessingResource (PR) VisualResources (VRs) are straightforward – they represent visualisation and editing components that participate in GUIs – but the distinction between language and processing resources merits further discussion.
Like other software, LE programs consist of data and algorithms. The current orthodoxy in software development is to model both data and algorithms together, as objects1. Systems that adopt the new approach are referred to as Object-Oriented (OO), and there are good reasons to believe that OO software is easier to build and maintain than other varieties [Booch 94, Yourdon 96].
In the domain of human language processing R&D, however, the terminology is a little more complex. Language data, in various forms, is of such significance in the field that it is frequently worked on independently of the algorithms that process it. For example: a treebank2 can be developed independently of the parsers that may later be trained from it; a thesaurus can be developed independently of the query expansion or sense tagging mechanisms that may later come to use it. This type of data has come to have its own term, Language Resources (LRs) [LREC-1 98], covering many data sources, from lexicons to corpora.
In recognition of this distinction, we will adopt the following terminology:
Additional terminology worthy of note in this context: language data refers to LRs which are at their core examples of language in practice, or ‘performance data’, e.g. corpora of texts or speech recordings (possibly including added descriptive information as markup); data about language refers to LRs which are purely descriptive, such as a grammar or lexicon.
PRs can be viewed as algorithms that map between different types of LR, and which typically use LRs in the mapping process. An MT engine, for example, maps a monolingual corpus into a multilingual aligned corpus using lexicons, grammars, etc.3
Further support for the PR/LR terminology may be gleaned from the argument in favour of declarative data structures for grammars, knowledge bases, etc. This argument was current in the late 1980s and early 1990s [Gazdar & Mellish 89], partly as a response to what has been seen as the overly procedural nature of previous techniques such as augmented transition networks. Declarative structures represent a separation between data about language and the algorithms that use the data to perform language processing tasks; a similar separation to that used in GATE.
Adopting the PR/LR distinction is a matter of conforming to established domain practice and terminology. It does not imply that we cannot model the domain (or build software to support it) in an Object-Oriented manner; indeed the models in GATE are themselves Object-Oriented.
According to Buschmann et al (Pattern-Oriented Software Architecture, 1996), the Model-View-Controller (MVC) pattern
...divides an interactive application into three components. The model contains the core functionality and data. Views display information to the user. Controllers handle user input. Views and controllers together comprise the user interface. A change-propagation mechanism ensures consistency between the user interface and the model. [p.125]
A variant of MVC, the Document-View pattern,
...relaxes the separation of view and controller... The View component of Document-View combines the responsibilities of controller and view in MVC, and implements the user interface of the system.
A benefit of both arrangements is that
...loose coupling of the document and view components enables multiple simultaneous synchronized but different views of the same document.
Geary (Graphic Java 2, 3rd Edtn., 1999) gives a slightly different view:
MVC separates applications into three types of objects:
[pp. 71, 75]
Swing, the Java user interface framework, uses
a specialised version of the classic MVC meant to support pluggable look and feel instead of applications in general. [p. 75]
GATE may be regarded as an MVC architecture in two ways:
The implementation of types should generally be hidden from the clients of the architecture.
With a few exceptions (such as for utility classes), clients of the framework work with the gate.* package. This package is mostly composed of interface definitions. Instantiations of these interfaces are obtained via the Factory class.
The subsidiary packages of GATE provide the implementations of the gate.* interfaces that are accessed via the factory. They themselves avoid directly constructing classes from other packages (with a few exceptions, such as JAPE’s need for unattached annotation sets). Instead they use the factory.
When and how to use exceptions? Borrowing from Bill Venners, here are some guidelines (with examples):
Example:
If the creation of a resource such as a document requires a URL as a parameter, the
method that does the creation needs to construct the URL and read from it. If there is
an exception during this process, the GATE method should abort by throwing its own
exception. The exception will be dealt with higher up the food chain, e.g. by asking
the user to input another URL, or by aborting a batch script.
Example:
With reference to the previous example, a problem using the URL will be signalled by
something like an UnknownHostException or an IOException. These should be caught
and re-thrown as descendants of GateException.
Example:
If a method is creating annotations on a document, and before creating the annotations
it checks that their start and end points are valid ranges in relation to the content
of the document (i.e. they fall within the offset space of the document, and the end
is after the start), then if the method receives an InvalidOffsetException from the
AnnotationSet.add call, something is seriously wrong. In such cases it may be best to
throw a GateRuntimeException.
Example:
The SAX XML parser API uses SaxException. Implementing a SAX parser for a
document type involves overriding methods that throw this exception. Where you want
to have a subtype for some problem which is specific to GATE processing, you could
use GateSaxException which extends SaxException.
Example:
See also the testing notes.
Example:
The gate.creole package has a ResourceInstantiationException - this deals with all problems
to do with creating resources. We could have had "ResourceUrlProblem" and
"ResourceParameterProblem" but that would probably have ended up with too many. On
the other hand, just throwing everything as GateException is too coarse (Hamish take
note!).
Example:
gate.jape.ParserException is correctly placed; if it was in gate.util it might clash with, for
example, gate.xml.ParserException if there was such.
1Older development methods like Jackson Structured Design [Jackson 75] or Structured Analysis [Yourdon 89] kept them largely separate.
2A corpus of texts annotated with syntactic analyses.
3This point is due to Wim Peters.