gate.util
Class HtmlLinksExtractor

java.lang.Object
  extended byjavax.swing.text.html.HTMLEditorKit.ParserCallback
      extended bygate.util.HtmlLinksExtractor

public class HtmlLinksExtractor
extends HTMLEditorKit.ParserCallback

This class extracts links from HTML files. It has been hacked to build the contents of http://gate.ac.uk/sitemap.html; you probably don't want to use it for anything else!

Implements the behaviour of the HTML reader. Methods of an object of this class are called by the HTML parser when events will appear.


Field Summary
private  HTML.Tag currentTag
          The tag currently being processed
(package private) static String currFile
          Name of the file we're currently processing
(package private) static String currPath
          Path to the file we're currently processing
private static boolean DEBUG
          Debug flag
(package private) static String endUl
          will contain </UL> after first title
(package private) static boolean firstTitle
          whether we've done a title before
 
Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
IMPLIED
 
Constructor Summary
HtmlLinksExtractor()
           
 
Method Summary
 void flush()
          This method is called once, when the HTML parser reaches the end of its input streamin order to notify the parserCallback that there is nothing more to parse.
 void handleComment(char[] text, int pos)
          This method is called when the HTML parser encounts a comment
 void handleEndTag(HTML.Tag t, int pos)
          This method is called when the HTML parser encounts the end of a tag that means that the tag is paired by a beginning tag
 void handleError(String errorMsg, int pos)
          This method is called when the HTML parser encounts an error it depends on the programmer if he wants to deal with that error
 void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos)
          This method is called when the HTML parser encounts an empty tag
 void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
          This method is called when the HTML parser encounts the beginning of a tag that means that the tag is paired by an end tag and it's not an empty one.
 void handleText(char[] text, int pos)
          This method is called when the HTML parser encounts text (PCDATA)
private static List listAllFiles(File aFile, Set foldersToIgnore)
          Given a certain folder it lists recursively all the files contained in that folder.
private static void listFilesRec(File aFile, List fileNames, List foldersToExplore, Set foldersToIgnore)
          Helper method for listAllFiles
static void main(String[] args)
          Extract links from all .html files below a directory
private  void printAttributes(MutableAttributeSet a)
           
 
Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
handleEndOfLineString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEBUG

private static final boolean DEBUG
Debug flag

See Also:
Constant Field Values

currentTag

private HTML.Tag currentTag
The tag currently being processed


firstTitle

static boolean firstTitle
whether we've done a title before


endUl

static String endUl
will contain </UL> after first title


currFile

static String currFile
Name of the file we're currently processing


currPath

static String currPath
Path to the file we're currently processing

Constructor Detail

HtmlLinksExtractor

public HtmlLinksExtractor()
Method Detail

handleStartTag

public void handleStartTag(HTML.Tag t,
                           MutableAttributeSet a,
                           int pos)
This method is called when the HTML parser encounts the beginning of a tag that means that the tag is paired by an end tag and it's not an empty one.


printAttributes

private void printAttributes(MutableAttributeSet a)

handleEndTag

public void handleEndTag(HTML.Tag t,
                         int pos)
This method is called when the HTML parser encounts the end of a tag that means that the tag is paired by a beginning tag


handleSimpleTag

public void handleSimpleTag(HTML.Tag t,
                            MutableAttributeSet a,
                            int pos)
This method is called when the HTML parser encounts an empty tag


handleText

public void handleText(char[] text,
                       int pos)
This method is called when the HTML parser encounts text (PCDATA)


handleError

public void handleError(String errorMsg,
                        int pos)
This method is called when the HTML parser encounts an error it depends on the programmer if he wants to deal with that error


flush

public void flush()
           throws BadLocationException
This method is called once, when the HTML parser reaches the end of its input streamin order to notify the parserCallback that there is nothing more to parse.

Throws:
BadLocationException

handleComment

public void handleComment(char[] text,
                          int pos)
This method is called when the HTML parser encounts a comment


listAllFiles

private static List listAllFiles(File aFile,
                                 Set foldersToIgnore)
Given a certain folder it lists recursively all the files contained in that folder. It returns a list of strings representing the file names


listFilesRec

private static void listFilesRec(File aFile,
                                 List fileNames,
                                 List foldersToExplore,
                                 Set foldersToIgnore)
Helper method for listAllFiles


main

public static void main(String[] args)
Extract links from all .html files below a directory