Invalid XML characters in output with Syntax.xml

### Background
I might be wrong but I believe there is a regression of issue #887. Most probably it was not really solved in the first place. Note that in #1556, which was solved with the same PR, the XML document is in version _1.1_. XML _1.1_ does permit escaped characters in the range `[#x1-#x8] | [#xB-#xC] | [#xE-#x1F] `. XML _1.0_ on the other hand, does not permit ASCII control characters like `&#xc;`. An XML parser should raise an error (and most of them actually do) if those characters appear in a document with an XML declaration of version _1.0_. This holds both if the character is included in binary (e.g. `"\u000c"`) or as an escaped sequence (e.g. `"&#xc;"`). Furthermore, if I a am not mistaken, [XHTML](https://www.w3.org/TR/xhtml1) is _a reformulation of XML version 1.0 [sic]_ and not 1.1. Finally, the [w3c validator](https://validator.w3.org/check) raises an error in all the variants of XHTML if the input includes the escape sequence `&#xc;`.

If this is considered out of the scope of jsoup that's fine but developers should know that the output might not be a valid XML document even when `OutputSettings.syntax` is `Syntax.xml`. Another option would be to discard those characters when `Syntax.xml` is used. To be honest it's not clear if `Syntax.xml` refers to XML version 1.0 or 1.1. In the code of [Entities.java](https://github.com/jhy/jsoup/blob/2a4c9decd617e7892fe767a535803b68d2268dca/src/main/java/org/jsoup/nodes/Entities.java#L223) though there is a reference to the spec of XML _1.0_. Maybe the name of the constant should have been `Syntax.xhtml` and not `Syntax.xml`. Nevertheless, if it were to choose between the two versions of XML most probably it is more coherent to go for _1.0_ since jsoup is an (X)HTML parser not an XML one.

I can help prepare a PR if someone can decide what the expected behaviour should be.

### Desired behaviour
``` java

@Test
public void invalidCharactersDiscardedInXml() {
    String invalid = "AAA&#xc;BBB\fCCC";
    Document doc = Jsoup.parseBodyFragment(invalid);
    OutputSettings settings = new OutputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).prettyPrint(false);
    String cleaned = doc.outputSettings(settings).toString();
    Assert.assertFalse(cleaned.contains("\f"));
    Assert.assertFalse(cleaned.contains("&#xc;"));
    Assert.assertTrue(cleaned.matches("AAA\\ *BBB\\ *CCC"));
}
```

A subtle detail in the test above is that if `prettyPrint` is true `&#xc;` will be replaced with a space but other control characters forbidden in XML documents (e.g. `&#xe;`) will not. The underlying problem is not really solved if pretty printing is enabled.

### Observed behaviour

``` java

@Test
public void invalidCharactersNotDiscardedInXml() {
    String invalid = "AAA&#xc;BBB\fCCC";
    Document doc = Jsoup.parseBodyFragment(invalid);
    OutputSettings settings = new OutputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).prettyPrint(false);
    String cleaned = doc.outputSettings(settings).body().html();
    Assert.assertEquals(cleaned, "AAA&#xc;BBB&#xc;CCC");
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Invalid XML characters in output with Syntax.xml #1743

Background

Desired behaviour

Observed behaviour

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Invalid XML characters in output with Syntax.xml #1743

Description

Background

Desired behaviour

Observed behaviour

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions