GATE.ac.uk - gatewiki/nutch-solr/nutch/README.txt

#changes to org.apache.nutch.parse.ParseText.java

If the special characters such as <, >, &, " and ' are indexed without being
encoded, they cause a problem when shown on the search result page.  The
parsers such as HTMLParser and XMLParser of Nutch take care of such characters
and encode them correctly, however if such special characters appear inside
a word document, for example, they are not encoded automatically.

Therefore, it is necessary to encode them.  I could not find any other way to
do it except changing one line in the  org.apache.nutch.parse.ParseText.java
file in the nutch sources.

In the constructor where it says:

  public ParseText(String text){
    this.text = text;
  }


please replace it with the following code:

  public ParseText(String text){
    // before we store this lets handle the special characters
    this.text = StringEscapeUtils.escapeHtml(StringEscapeUtils.unescapeHtml(text));
  }

Please make sure you add "import org.apache.commons.lang.StringEscapeUtils;" as
an import statement.  Please rebuild all three files, .job, .war and .jar and
put them inside the gatewiki/nutch-solr/nutch/ replacing the respective files.

This code will make sure that the content of HTML pages in which entities are
already encoded, they get unescaped first and all the special characters
are encoded then properly.