Remove illegal XML characters when converting HTML to XML

There are certain unicode characters which are prohibited by the [XML spec](https://www.w3.org/TR/xml/#charsets). I've written the following method which should strip remove these characters from a document
```java
String cleanHtml(String source) {
    Document document = Jsoup.parse(source);
    document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
    return document.html();
}
```

If I test this using the following HTML input

```html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>

<table>
    <tbody>
    <tr>
        <td>Field Value</td>
        <td>before &#9;&#10;&#12; after</td>
    </tr>
    </tbody>
</table>

</body>
</html>
```

The XML entities representing illegal unicode character are removed and the resulting document can be parsed by an XML parser. However, if I add `&#11;`
```html
<td>before &#9;&#10;&#12;&#11; after</td>
```
then the String returned by `cleanHtml` throws the following exception when parsed as XML

> org.xml.sax.SAXParseException; lineNumber: 10; columnNumber: 17; An invalid XML character (Unicode: 0xb) was found in the element content of the document.



 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove illegal XML characters when converting HTML to XML #887

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Remove illegal XML characters when converting HTML to XML #887

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions