Skip to content

Conversation

rdeltour
Copy link
Member

On rare occasions, decoding UTF-8 documents caused a fatal error RSC-016 (Invalid byte 2 of 4-byte UTF-8 sequence.).

This was likely due to a bug in the Xerces XML parser decoding component, see https://issues.apache.org/jira/browse/XERCESJ-1668

As a workaround, we now read documents using the Java built-in UTF-8 decoder instead of Xerces's own decoder, by creating the SAX parsers' InputSource from an InputStreamReader instead of the raw InputStream.

Fixes #1548

@rdeltour rdeltour added the status: ready to merge The pull request is ready to be merged label Dec 23, 2024
@rdeltour rdeltour added this to the Next maintenance release milestone Dec 23, 2024
@rdeltour rdeltour self-assigned this Dec 23, 2024
Base automatically changed from fix/1546/undefined-entities to main December 23, 2024 10:03
On rare occasions, decoding UTF-8 documents caused a fatal error RSC-016
(`Invalid byte 2 of 4-byte UTF-8 sequence.`).

This was likely due to a bug in the Xerces XML parser decoding component,
see https://issues.apache.org/jira/browse/XERCESJ-1668

As a workaround, we now read documents using the Java built-in UTF-8
decoder instead of Xerces's own decoder, by creating the SAX parsers'
InputSource from an InputStreamReader instead of the raw InputStream.

Fixes #1548
@rdeltour
Copy link
Member Author

Also fixes #1554

@rdeltour rdeltour merged commit 90e87b2 into main Dec 29, 2024
5 checks passed
@rdeltour rdeltour deleted the fix/1548/invalid-utf8-sequence branch December 29, 2024 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: ready to merge The pull request is ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unjustified error message "Invalid byte 2 of 4-byte UTF-8 sequence"
1 participant