|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectgate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractProcessingResource
gate.creole.AbstractLanguageAnalyser
gate.creole.tokeniser.SimpleTokeniser
Implementation of a Unicode rule based tokeniser.
The tokeniser gets its rules from a file an InputStream
or a Reader
which should be sent to one
of the constructors.
The implementations is based on a finite state machine that is built based
on the set of rules.
A rule has two sides, the left hand side (LHS)and the right hand side (RHS)
that are separated by the ">" character. The LHS represents a
regular expression that will be matched against the input while the RHS
describes a Gate2 annotation in terms of annotation type and attribute-value
pairs.
The matching is done using Unicode enumarated types as defined by the Character
class. At the time of writing this class the
suported Unicode categories were:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+ > Token;kind=upperInitial;
Nested Class Summary |
Nested classes inherited from class gate.creole.AbstractProcessingResource |
AbstractProcessingResource.InternalStatusListener, AbstractProcessingResource.IntervalProgressListener |
Field Summary | |
protected String |
annotationSetName
the annotations et where the new annotations will be adde |
private static boolean |
DEBUG
Debug flag |
protected static String |
defaultResourceName
|
protected Set |
dfsmStates
A set containng all the states of the deterministic machin |
protected DFSMState |
dInitialState
The initial state of the deterministic machin |
private String |
encoding
|
protected FeatureMap |
features
|
protected Set |
fsmStates
A set containng all the states of the non deterministic machin |
(package private) static Set |
ignoreTokens
A set of string representing tokens to be ignored (e.g. |
protected FSMState |
initialState
The initial state of the non deterministic machin |
(package private) static String |
LHStoRHS
The separator from LHS to RH |
static int |
maxTypeId
The maximum int value used internally as a type i |
protected Map |
newStates
|
private Vector |
progressListeners
|
private String |
rulesResourceName
|
private URL |
rulesURL
|
static String |
SIMP_TOK_ANNOT_SET_PARAMETER_NAME
|
static String |
SIMP_TOK_DOCUMENT_PARAMETER_NAME
|
static String |
SIMP_TOK_ENCODING_PARAMETER_NAME
|
static String |
SIMP_TOK_RULES_URL_PARAMETER_NAME
|
static Map |
stringTypeIds
Maps from type names to type internal id |
static Map |
typeIds
maps from int (the static value on Character to int
the internal value used by the tokeniser. |
static String[] |
typeMnemonics
Maps the internal type ids to the type name |
Fields inherited from class gate.creole.AbstractLanguageAnalyser |
corpus, document |
Fields inherited from class gate.creole.AbstractProcessingResource |
interrupted |
Fields inherited from class gate.creole.AbstractResource |
name |
Constructor Summary | |
SimpleTokeniser()
Creates a tokeniser |
Method Summary | |
(package private) void |
eliminateVoidTransitions()
Converts the FSM from a non-deterministic to a deterministic one by eliminating all the unrestricted transitions. |
void |
execute()
The method that does the actual tokenisation. |
String |
getAnnotationSetName()
|
String |
getDFSMgml()
Returns a string representation of the deterministic FSM graph using GML. |
String |
getEncoding()
|
FeatureMap |
getFeatures()
Get the feature set |
String |
getFSMgml()
Returns a string representation of the non-deterministic FSM graph using GML (Graph modelling language). |
String |
getRulesResourceName()
|
URL |
getRulesURL()
Gets the value of the rulesURL property hich holds an
URL to the file containing the rules for this tokeniser. |
Resource |
init()
Initialises this tokeniser by reading the rules from an external source (provided through an URL) and building the finite state machine at the core of the tokeniser. |
private AbstractSet |
lambdaClosure(Set s)
Converts the finite state machine to a deterministic one. |
(package private) FSMState |
parseLHS(FSMState startState,
StringTokenizer st,
String until)
Parses a part or the entire LHS. |
(package private) String |
parseQuotedString(StringTokenizer st,
String until)
Parses from the given string tokeniser until it finds a specific delimiter. |
(package private) void |
parseRule(String line)
Parses one input line containing a tokeniser rule. |
void |
reset()
Prepares this Processing resource for a new run. |
void |
setAnnotationSetName(String newAnnotationSetName)
|
void |
setEncoding(String newEncoding)
|
void |
setFeatures(FeatureMap features)
Set the feature set |
void |
setRulesResourceName(String newRulesResourceName)
|
void |
setRulesURL(URL newRulesURL)
Sets the value of the rulesURL property which holds an URL
to the file containing the rules for this tokeniser. |
protected static String |
skipIgnoreTokens(StringTokenizer st)
Skips the ignorable tokens from the input returning the first significant token. |
Methods inherited from class gate.creole.AbstractLanguageAnalyser |
getCorpus, getDocument, setCorpus, setDocument |
Methods inherited from class gate.creole.AbstractProcessingResource |
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, interrupt, isInterrupted, reInit, removeProgressListener, removeStatusListener |
Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface gate.ProcessingResource |
reInit |
Methods inherited from interface gate.Resource |
cleanup, getParameterValue, setParameterValue, setParameterValues |
Methods inherited from interface gate.util.NameBearer |
getName, setName |
Methods inherited from interface gate.Executable |
interrupt, isInterrupted |
Field Detail |
public static final String SIMP_TOK_DOCUMENT_PARAMETER_NAME
public static final String SIMP_TOK_ANNOT_SET_PARAMETER_NAME
public static final String SIMP_TOK_RULES_URL_PARAMETER_NAME
public static final String SIMP_TOK_ENCODING_PARAMETER_NAME
private static final boolean DEBUG
protected FeatureMap features
protected String annotationSetName
protected FSMState initialState
protected Set fsmStates
protected DFSMState dInitialState
protected Set dfsmStates
static String LHStoRHS
static Set ignoreTokens
public static Map typeIds
Character
to int
the internal value used by the tokeniser. The ins values used by the
tokeniser are consecutive values, starting from 0 and going as high as
necessary.
They map all the public static int members onCharacter
public static int maxTypeId
public static String[] typeMnemonics
public static Map stringTypeIds
protected static String defaultResourceName
private String rulesResourceName
private URL rulesURL
private String encoding
private transient Vector progressListeners
protected transient Map newStates
Constructor Detail |
public SimpleTokeniser()
Method Detail |
public Resource init() throws ResourceInstantiationException
init
in interface Resource
init
in class AbstractProcessingResource
ResourceInstantiationException
public void reset()
void parseRule(String line) throws TokeniserException
line
- the string containing the rule
TokeniserException
FSMState parseLHS(FSMState startState, StringTokenizer st, String until) throws TokeniserException
startState
- a FSMState object representing the initial state for
the small FSM that will recognise the (part of) the rule parsed by this
method.st
- a StringTokenizer
that
provides the inputuntil
- the string that marks the end of the section to be
recognised. This method will first be called by parseRule(String)
with " >" in order to parse the entire
LHS. when necessary it will make itself another call to parseLHS
to parse a region of the LHS (e.g. a
"(",")" enclosed part.
TokeniserException
String parseQuotedString(StringTokenizer st, String until) throws TokeniserException
st
- a StringTokenizer
that
provides the inputuntil
- a String representing the end delimiter.
TokeniserException
protected static String skipIgnoreTokens(StringTokenizer st)
a set
private AbstractSet lambdaClosure(Set s)
s
- void eliminateVoidTransitions() throws TokeniserException
TokeniserException
public String getFSMgml()
public String getDFSMgml()
public FeatureMap getFeatures()
FeatureBearer
getFeatures
in interface FeatureBearer
getFeatures
in class AbstractFeatureBearer
public void setFeatures(FeatureMap features)
FeatureBearer
setFeatures
in interface FeatureBearer
setFeatures
in class AbstractFeatureBearer
public void execute() throws ExecutionException
execute
in interface Executable
execute
in class AbstractProcessingResource
ExecutionException
public void setRulesURL(URL newRulesURL)
rulesURL
property which holds an URL
to the file containing the rules for this tokeniser.
newRulesURL
- public URL getRulesURL()
rulesURL
property hich holds an
URL to the file containing the rules for this tokeniser.
public void setAnnotationSetName(String newAnnotationSetName)
public String getAnnotationSetName()
public void setRulesResourceName(String newRulesResourceName)
public String getRulesResourceName()
public void setEncoding(String newEncoding)
public String getEncoding()
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |