This directory contains a Flex (and C) program that tokenises English text in plain ASCII or Latin-1 (8-bit) encodings. Sentence boundaries are assumed to have been marked already, by '^ ' (perhaps using the 'sentence' tool also provided here). Text needs to be tokenised before tagging, morphological analysis and parsing. The tokeniser (the file token.flex) was written by John Carroll (University of Sussex, 2001-3). ---------------------------------------------------------------- Files: token.flex run.prl token.test Compilation: $ flex token.flex - compiling the flex code $ gcc lex.yy.c -o token - compiling the C code $ rm lex.yy.c - deleting the intermediate file Command line options: none. 1) Run the tokeniser: token < input_text > output A test run: $ token < token.test > output 2) Run the tokeniser, output lines numbered: perl run.prl input_text > output Before running, change the variable assignment for $tokenexe in run.prl to point to the executable that you wish to use. A test run: $ perl run.prl token.test > output ---------------------------------------------------------------- About the tokeniser: The tokeniser determines the basic units (words/tokens) for the tagger and parser. One basic function is to separate words from punctuation (taking care with ellipses, double quotes etc.), possessive markers ("'s", "'") and contractions ("n't", "'ll"). Another is to recognise SGML entities (e.g. £). The program also handles multi-character punctuation ("...", "--"). All control chars, as well as space and Latin-1 non-breakable space, are regarded as whitespace. The program is just based on regular expressions and simple rules, and is therefore far from failproof. However, the tokeniser should give a very fast initial processing of raw ASCII text (e.g. from the Internet) with a relatively high accuracy, and can easily be customised and adjusted for special cases. For further details, see the comments included in token.flex.