You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 24, 2025. It is now read-only.
Oniguruma docs simply state: "Do not pass invalid byte string in the regex character encoding."
Based on my testing, following are the behavior details for invalid standalone encoded bytes:
Standalone \x80 to \xBF throw error "invalid code point value".
Standalone \xC0 to \xF4 throw error "too short multibyte code string".
Standalone \xF5 to \xFF fail to match anything, but don't throw. This feels like a bug.
The behavior changes if an invalid standalone encoded byte is used as the end value of a character class range (due to option ONIG_SYN_ALLOW_INVALID_CODE_END_OF_RANGE_IN_CC):
Standalone \x80 to \xBF are treated as \x7F.
Standalone \xC0 to \xF4 throw error "too short multibyte code string". This feels like a bug.
Standalone \xF5 to \xFF are treated as \x7F.
If the range is within a negated, non-nested character class (ex: [^\0-\xFF]), \xF5 to \xFF are handled like \x{10FFFF}. This feels like a bug. For example, [^\x01-\xFF] matches only \x00. The bug can be worked around by nesting the class. For example, [[^\0-\xFF]] matches \x80 and higher.
Disclosure: I'm testing using Oniguruma 6.9.8 via vscode-oniguruma. However, the release notes for subsequent versions don't mention any related changes.