A {@link edu.cmu.minorthird.text.TextToken} is a "token" (usually a single word in a document), plus some additional information that allows one to find out where this word/token occured. Specifically one can recover the string that contained the token, a shorter string identifier of this "document" string, and the character offsets of the token--i.e., where it appeared in the document string.
A {@link edu.cmu.minorthird.text.Span} is a sequence of adjacent TextTokens from the same document.
Spans and TextTokens are considered to be inheritantly ordered. If two Spans or TextTokens are from different document, they are ordered lexigraphically based on the identifiers of those documents. Within a single document, TextTokens are according to their position in their document, and Spans are ordered according to their leftmost TextToken (using the rightmost TextToken to break ties.)
A {@link edu.cmu.minorthird.text.TextBase} is a collection of tokenized "document" strings, accessible as Spans.
A {@link edu.cmu.minorthird.text.TextLabels} contains markup for a {@link edu.cmu.minorthird.text.TextBase}. This markup can consist of
Markup in a TextLabels object is usually provided by an {@link
edu.cmu.minorthird.text.Annotator}. A sort of subroutine-calling
mechanism for Annotators is provided by the
textLabels.require
call, the
textLabels.isAnnotatedBy
call, and the {@link
edu.cmu.minorthird.text.AnnotatorLoader} mechanism. If one
Annotator relies on the output of another---for instance, an NP
chunker requires POS tags---it should use the
textLabels.require
method to make sure that the
annotation is present. textLabels.require
then uses
an AnnotatorLoader to find an Annotator that will produce the
required annotation type, using the
annotatorLoader.findAnnotator
method. Annotators
record the fact that they have been run on a textLabels object by
using the textLabels.setAnnotatedBy(...)
method;
this ensures that annotations are not run more than once.
Taken together these mechanisms provide something in between a programming language for annotations, and a simple planner for constructing annotations. As a planner, each Annotator corresponds to an operator: its preconditions are specified by calls to "require", and its postconditions are specified by calls to "setAnnotatedBy" (or in mixup, by "provide" statements.) The AnnotatorLoader corresponds to a backwards-chaining planner, and its decisions about what Annotator to use are how the plan is constructed.
However, the AnnotatorLoader don't do anything fancy to find Annotators: in response to a "require" call for label "foo", the AnnotatorLoader looks for a file "foo.mixup" or a Java class names "foo", in that order. So the default behavior is simple enough that it looks more like a programming language, with the AnnotatorLoader being just a binding mechanism.
There are several ways the binding mechanism can be modified.
require
call, one can specify a filename
in addition to a desired label type (in mixup, this is the
second argument to the "require" call). This causes this
filename to be used instead of the the default "foo.mixup" or
Java class "foo".
annotators.config
file, (usually located
in minorthird/config), one can specify default filenames for a
set of label types "foo". These will be used instead of
"foo.mixup", unless some other filename is specified.
require
,
which uses different rules to find files.
The main use of this mechanisms is the {@link edu.cmu.minorthird.text.EncapsulatingAnnotatorLoader}, which contains a cache of files and/or Java classes that it will use in preference to anything on the classpath. This is useful if you want to bundle a bunch of Annotators along with a classifier or extractor that uses them.
Currently, AnnotatorLoaders are not used for loading Mixup resources like dictionary files, only for loading Annotators.
A {@link edu.cmu.minorthird.text.NestedTextLabels} is an odd sort of implementation of a MonotonicTextLabels. It combines two TextLabels's, an "inner" one and an "outer" one, such that the outer one can be monotonically added to, but the inner one is never modified. Semantically, the markup in a NestedTextLabels is the union of the markup in the inner and outer TextLabels's, except that property values in the outer TextLabels "shadow" values in the inner TextLabels. This has several possible uses, for instance: