Skip to content

Remove illegal XML characters when converting HTML to XML #887

@donalmurtagh

Description

@donalmurtagh

There are certain unicode characters which are prohibited by the XML spec. I've written the following method which should strip remove these characters from a document

String cleanHtml(String source) {
    Document document = Jsoup.parse(source);
    document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
    return document.html();
}

If I test this using the following HTML input

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>

<table>
    <tbody>
    <tr>
        <td>Field Value</td>
        <td>before &#9;&#10;&#12; after</td>
    </tr>
    </tbody>
</table>

</body>
</html>

The XML entities representing illegal unicode character are removed and the resulting document can be parsed by an XML parser. However, if I add &#11;

<td>before &#9;&#10;&#12;&#11; after</td>

then the String returned by cleanHtml throws the following exception when parsed as XML

org.xml.sax.SAXParseException; lineNumber: 10; columnNumber: 17; An invalid XML character (Unicode: 0xb) was found in the element content of the document.

Metadata

Metadata

Assignees

No one assigned

    Labels

    duplicateThis is a duplicate issue or root-cause of another issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions