We present a practical problem that involves the analysis of a large dataset of heterogeneous documents obtained by crawling the web for unstructured and semi-structured human-readable documents (HTML, PDF) related to web services as well as their machine-readable WSDL files. The analysis uses natural language processing (NLP), information extraction (IE), some specialized techniques for WSDL analysis, and various approaches to classifying web services (defined by sets of documents). The results of the analysis are exported as RDF for use in the back-end of a portal that uses Web 2.0 and Semantic Web technology. Triples representing manual annotations made on the portal are also exported back to our application to evaluate parts of our analysis and for use as training data for machine learning (ML) to improve and evaluate the service classification. This application was implemented in the GATE framework and successfully incorporated into an integrated project, and included a number of components shared with our group's other projects.