#changes to org.apache.nutch.parse.ParseText.java If the special characters such as <, >, &, " and ' are indexed without being encoded, they cause a problem when shown on the search result page. The parsers such as HTMLParser and XMLParser of Nutch take care of such characters and encode them correctly, however if such special characters appear inside a word document, for example, they are not encoded automatically. Therefore, it is necessary to encode them. I could not find any other way to do it except changing one line in the org.apache.nutch.parse.ParseText.java file in the nutch sources. In the constructor where it says: public ParseText(String text){ this.text = text; } please replace it with the following code: public ParseText(String text){ // before we store this lets handle the special characters this.text = StringEscapeUtils.escapeHtml(StringEscapeUtils.unescapeHtml(text)); } Please make sure you add "import org.apache.commons.lang.StringEscapeUtils;" as an import statement. Please rebuild all three files, .job, .war and .jar and put them inside the gatewiki/nutch-solr/nutch/ replacing the respective files. This code will make sure that the content of HTML pages in which entities are already encoded, they get unescaped first and all the special characters are encoded then properly.