-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Closed
Labels
duplicateThis is a duplicate issue or root-cause of another issueThis is a duplicate issue or root-cause of another issue
Description
There are certain unicode characters which are prohibited by the XML spec. I've written the following method which should strip remove these characters from a document
String cleanHtml(String source) {
Document document = Jsoup.parse(source);
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
return document.html();
}
If I test this using the following HTML input
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table>
<tbody>
<tr>
<td>Field Value</td>
<td>before 	  after</td>
</tr>
</tbody>
</table>
</body>
</html>
The XML entities representing illegal unicode character are removed and the resulting document can be parsed by an XML parser. However, if I add 
<td>before 	  after</td>
then the String returned by cleanHtml
throws the following exception when parsed as XML
org.xml.sax.SAXParseException; lineNumber: 10; columnNumber: 17; An invalid XML character (Unicode: 0xb) was found in the element content of the document.
hardiready
Metadata
Metadata
Assignees
Labels
duplicateThis is a duplicate issue or root-cause of another issueThis is a duplicate issue or root-cause of another issue