gate.creole.tokeniser
Class SimpleTokeniser

java.lang.Object
  |
  +--gate.util.AbstractFeatureBearer
        |
        +--gate.creole.AbstractResource
              |
              +--gate.creole.AbstractProcessingResource
                    |
                    +--gate.creole.AbstractLanguageAnalyser
                          |
                          +--gate.creole.tokeniser.SimpleTokeniser
All Implemented Interfaces:
ANNIEConstants, Executable, FeatureBearer, LanguageAnalyser, NameBearer, ProcessingResource, Resource, Serializable

public class SimpleTokeniser
extends AbstractLanguageAnalyser

Implementation of a Unicode rule based tokeniser. The tokeniser gets its rules from a file an InputStream or a Reader which should be sent to one of the constructors. The implementations is based on a finite state machine that is built based on the set of rules. A rule has two sides, the left hand side (LHS)and the right hand side (RHS) that are separated by the ">" character. The LHS represents a regular expression that will be matched against the input while the RHS describes a Gate2 annotation in terms of annotation type and attribute-value pairs. The matching is done using Unicode enumarated types as defined by the Character class. At the time of writing this class the suported Unicode categories were:

The accepted operators for the LHS are "+", "*" and "|" having the usual interpretations of "1 to n occurences", "0 to n occurences" and "boolean OR". For instance this is a valid LHS:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+
meaning an uppercase letter followed by one or more lowercase letters. The RHS describes an annotation that is to be created and inserted in the annotation set provided in case of a match. The new annotation will span the text that has been recognised. The RHS consists in the annotation type followed by pairs of attributes and associated values. E.g. for the LHS above a possible RHS can be:
Token;kind=upperInitial;
representing an annotation of type "Token" having one attribute named "kind" with the value "upperInitial"
The entire rule willbe:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+ > Token;kind=upperInitial;

The tokeniser ignores all the empty lines or the ones that start with # or //.

See Also:
Serialized Form

Inner classes inherited from class gate.creole.AbstractProcessingResource
AbstractProcessingResource.InternalStatusListener, AbstractProcessingResource.IntervalProgressListener
 
Field Summary
protected  String annotationSetName
          the annotations et where the new annotations will be adde
private static boolean DEBUG
          Debug flag
protected static String defaultResourceName
           
protected  Set dfsmStates
          A set containng all the states of the deterministic machin
protected  DFSMState dInitialState
          The initial state of the deterministic machin
private  String encoding
           
protected  FeatureMap features
           
protected  Set fsmStates
          A set containng all the states of the non deterministic machin
(package private) static Set ignoreTokens
          A set of string representing tokens to be ignored (e.g.
protected  FSMState initialState
          The initial state of the non deterministic machin
(package private) static String LHStoRHS
          The separator from LHS to RH
static int maxTypeId
          The maximum int value used internally as a type i
protected  Map newStates
           
private  Vector progressListeners
           
private  String rulesResourceName
           
private  URL rulesURL
           
static String SIMP_TOK_ANNOT_SET_PARAMETER_NAME
           
static String SIMP_TOK_DOCUMENT_PARAMETER_NAME
           
static String SIMP_TOK_ENCODING_PARAMETER_NAME
           
static String SIMP_TOK_RULES_URL_PARAMETER_NAME
           
static Map stringTypeIds
          Maps from type names to type internal id
static Map typeIds
          maps from int (the static value on Character to int the internal value used by the tokeniser.
static String[] typeMnemonics
          Maps the internal type ids to the type name
 
Fields inherited from class gate.creole.AbstractLanguageAnalyser
corpus, document
 
Fields inherited from class gate.creole.AbstractProcessingResource
interrupted, statusListeners
 
Fields inherited from class gate.creole.AbstractResource
name, serialVersionUID
 
Fields inherited from interface gate.creole.ANNIEConstants
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DOCUMENT_COREF_FEATURE_NAME, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME
 
Constructor Summary
SimpleTokeniser()
          Creates a tokeniser
 
Method Summary
(package private) static void ()
          The static initialiser will inspect the class Character using reflection to find all the public static members and will map them to ids starting from 0.
(package private)  void eliminateVoidTransitions()
          Converts the FSM from a non-deterministic to a deterministic one by eliminating all the unrestricted transitions.
 void execute()
          The method that does the actual tokenisation.
 String getAnnotationSetName()
           
 String getDFSMgml()
          Returns a string representation of the deterministic FSM graph using GML.
 String getEncoding()
           
 FeatureMap getFeatures()
          Get the feature set
 String getFSMgml()
          Returns a string representation of the non-deterministic FSM graph using GML (Graph modelling language).
 String getRulesResourceName()
           
 URL getRulesURL()
          Gets the value of the rulesURL property hich holds an URL to the file containing the rules for this tokeniser.
 Resource init()
          Initialises this tokeniser by reading the rules from an external source (provided through an URL) and building the finite state machine at the core of the tokeniser.
private  AbstractSet lambdaClosure(Set s)
          Converts the finite state machine to a deterministic one.
(package private)  FSMState parseLHS(FSMState startState, StringTokenizer st, String until)
          Parses a part or the entire LHS.
(package private)  String parseQuotedString(StringTokenizer st, String until)
          Parses from the given string tokeniser until it finds a specific delimiter.
(package private)  void parseRule(String line)
          Parses one input line containing a tokeniser rule.
 void reset()
          Prepares this Processing resource for a new run.
 void setAnnotationSetName(String newAnnotationSetName)
           
 void setEncoding(String newEncoding)
           
 void setFeatures(FeatureMap features)
          Set the feature set
 void setRulesResourceName(String newRulesResourceName)
           
 void setRulesURL(URL newRulesURL)
          Sets the value of the rulesURL property which holds an URL to the file containing the rules for this tokeniser.
protected static String skipIgnoreTokens(StringTokenizer st)
          Skips the ignorable tokens from the input returning the first significant token.
 
Methods inherited from class gate.creole.AbstractLanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from class gate.creole.AbstractProcessingResource
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, interrupt, isInterrupted, reInit, removeProgressListener, removeStatusListener
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, toString, wait, wait, wait
 
Methods inherited from interface gate.ProcessingResource
interrupt, isInterrupted, reInit
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 

Field Detail

SIMP_TOK_DOCUMENT_PARAMETER_NAME

public static final String SIMP_TOK_DOCUMENT_PARAMETER_NAME

SIMP_TOK_ANNOT_SET_PARAMETER_NAME

public static final String SIMP_TOK_ANNOT_SET_PARAMETER_NAME

SIMP_TOK_RULES_URL_PARAMETER_NAME

public static final String SIMP_TOK_RULES_URL_PARAMETER_NAME

SIMP_TOK_ENCODING_PARAMETER_NAME

public static final String SIMP_TOK_ENCODING_PARAMETER_NAME

DEBUG

private static final boolean DEBUG
Debug flag

features

protected FeatureMap features

annotationSetName

protected String annotationSetName
the annotations et where the new annotations will be adde

initialState

protected FSMState initialState
The initial state of the non deterministic machin

fsmStates

protected Set fsmStates
A set containng all the states of the non deterministic machin

dInitialState

protected DFSMState dInitialState
The initial state of the deterministic machin

dfsmStates

protected Set dfsmStates
A set containng all the states of the deterministic machin

LHStoRHS

static String LHStoRHS
The separator from LHS to RH

ignoreTokens

static Set ignoreTokens
A set of string representing tokens to be ignored (e.g. blanks

typeIds

public static Map typeIds
maps from int (the static value on Character to int the internal value used by the tokeniser. The ins values used by the tokeniser are consecutive values, starting from 0 and going as high as necessary. They map all the public static int members onCharacter

maxTypeId

public static int maxTypeId
The maximum int value used internally as a type i

typeMnemonics

public static String[] typeMnemonics
Maps the internal type ids to the type name

stringTypeIds

public static Map stringTypeIds
Maps from type names to type internal id

defaultResourceName

protected static String defaultResourceName

rulesResourceName

private String rulesResourceName

rulesURL

private URL rulesURL

encoding

private String encoding

progressListeners

private transient Vector progressListeners

newStates

protected transient Map newStates
Constructor Detail

SimpleTokeniser

public SimpleTokeniser()
Creates a tokeniser
Method Detail

init

public Resource init()
              throws ResourceInstantiationException
Initialises this tokeniser by reading the rules from an external source (provided through an URL) and building the finite state machine at the core of the tokeniser.
Overrides:
init in class AbstractProcessingResource
Throws:
ResourceInstantiationException -  

reset

public void reset()
Prepares this Processing resource for a new run.

parseRule

void parseRule(String line)
         throws TokeniserException
Parses one input line containing a tokeniser rule. This will create the necessary FSMState objects and the links between them.
Parameters:
line - the string containing the rule

parseLHS

FSMState parseLHS(FSMState startState,
                  StringTokenizer st,
                  String until)
            throws TokeniserException
Parses a part or the entire LHS.
Parameters:
startState - a FSMState object representing the initial state for the small FSM that will recognise the (part of) the rule parsed by this method.
st - a StringTokenizer that provides the input
until - the string that marks the end of the section to be recognised. This method will first be called by parseRule(String) with " >" in order to parse the entire LHS. when necessary it will make itself another call to parseLHS to parse a region of the LHS (e.g. a "(",")" enclosed part.

parseQuotedString

String parseQuotedString(StringTokenizer st,
                         String until)
                   throws TokeniserException
Parses from the given string tokeniser until it finds a specific delimiter. One use for this method is to read everything until the first quote.
Parameters:
st - a StringTokenizer that provides the input
until - a String representing the end delimiter.

skipIgnoreTokens

protected static String skipIgnoreTokens(StringTokenizer st)
Skips the ignorable tokens from the input returning the first significant token. The ignorable tokens are defined by a set

lambdaClosure

private AbstractSet lambdaClosure(Set s)
Converts the finite state machine to a deterministic one.
Parameters:
s -  

eliminateVoidTransitions

void eliminateVoidTransitions()
                        throws TokeniserException
Converts the FSM from a non-deterministic to a deterministic one by eliminating all the unrestricted transitions.

getFSMgml

public String getFSMgml()
Returns a string representation of the non-deterministic FSM graph using GML (Graph modelling language).

getDFSMgml

public String getDFSMgml()
Returns a string representation of the deterministic FSM graph using GML.

getFeatures

public FeatureMap getFeatures()
Description copied from interface: FeatureBearer
Get the feature set
Overrides:
getFeatures in class AbstractFeatureBearer

setFeatures

public void setFeatures(FeatureMap features)
Description copied from interface: FeatureBearer
Set the feature set
Overrides:
setFeatures in class AbstractFeatureBearer

execute

public void execute()
             throws ExecutionException
The method that does the actual tokenisation.
Overrides:
execute in class AbstractProcessingResource

setRulesURL

public void setRulesURL(URL newRulesURL)
Sets the value of the rulesURL property which holds an URL to the file containing the rules for this tokeniser.
Parameters:
newRulesURL -  

getRulesURL

public URL getRulesURL()
Gets the value of the rulesURL property hich holds an URL to the file containing the rules for this tokeniser.

setAnnotationSetName

public void setAnnotationSetName(String newAnnotationSetName)

getAnnotationSetName

public String getAnnotationSetName()

setRulesResourceName

public void setRulesResourceName(String newRulesResourceName)

getRulesResourceName

public String getRulesResourceName()

setEncoding

public void setEncoding(String newEncoding)

getEncoding

public String getEncoding()

static void ()
The static initialiser will inspect the class Character using reflection to find all the public static members and will map them to ids starting from 0. After that it will build all the static data: typeIds, maxTypeId, typeMnemonics, stringTypeIds