Handling of invalid standalone encoded bytes `\x80` to `\xFF`

Oniguruma docs simply state: "Do not pass invalid byte string in the regex character encoding."

Based on my testing, following are the behavior details for invalid standalone encoded bytes:

- Standalone `\x80` to `\xBF` throw error "invalid code point value".
- Standalone `\xC0` to `\xF4` throw error "too short multibyte code string".
- Standalone `\xF5` to `\xFF` fail to match anything, but don't throw. This feels like a bug.

The behavior changes if an invalid standalone encoded byte is used as the end value of a character class range (due to option `ONIG_SYN_ALLOW_INVALID_CODE_END_OF_RANGE_IN_CC`):

- Standalone `\x80` to `\xBF` are treated as `\x7F`.
- Standalone `\xC0` to `\xF4` throw error "too short multibyte code string". This feels like a bug.
- Standalone `\xF5` to `\xFF` are treated as `\x7F`.
 - If the range is within a *negated*, *non-nested* character class (ex: `[^\0-\xFF]`), `\xF5` to `\xFF` are handled like `\x{10FFFF}`. This feels like a bug. For example, `[^\x01-\xFF]` matches only `\x00`. The bug can be worked around by nesting the class. For example, `[[^\0-\xFF]]` matches `\x80` and higher.

Disclosure: I'm testing using Oniguruma 6.9.8 via vscode-oniguruma. However, the release notes for subsequent versions don't mention any related changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling of invalid standalone encoded bytes `\x80` to `\xFF` #345

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Handling of invalid standalone encoded bytes \x80 to \xFF #345

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Handling of invalid standalone encoded bytes `\x80` to `\xFF` #345