GATE.ac.uk - sale/lrec2010/nlpfw-sf/longer

We present a practical problem that involves the analysis of a large
dataset of heterogeneous documents obtained by crawling the web for
unstructured and semi-structured human-readable documents (HTML, PDF)
related to web services as well as their machine-readable WSDL files.
The analysis uses natural language processing (NLP), information
extraction (IE), some specialized techniques for WSDL analysis, and
various approaches to classifying web services (defined by sets of
documents).  The results of the analysis are exported as RDF for use
in the back-end of a portal that uses Web 2.0 and Semantic Web
technology.  Triples representing manual annotations made on the
portal are also exported back to our application to evaluate parts of
our analysis and for use as training data for machine learning (ML) to
improve and evaluate the service classification.  This application was
implemented in the GATE framework and successfully incorporated into
an integrated project, and included a number of components shared with
our group's other projects.