Skip to content
This repository was archived by the owner on Apr 24, 2025. It is now read-only.
This repository was archived by the owner on Apr 24, 2025. It is now read-only.

Handling of invalid standalone encoded bytes \x80 to \xFF #345

@slevithan

Description

@slevithan

Oniguruma docs simply state: "Do not pass invalid byte string in the regex character encoding."

Based on my testing, following are the behavior details for invalid standalone encoded bytes:

  • Standalone \x80 to \xBF throw error "invalid code point value".
  • Standalone \xC0 to \xF4 throw error "too short multibyte code string".
  • Standalone \xF5 to \xFF fail to match anything, but don't throw. This feels like a bug.

The behavior changes if an invalid standalone encoded byte is used as the end value of a character class range (due to option ONIG_SYN_ALLOW_INVALID_CODE_END_OF_RANGE_IN_CC):

  • Standalone \x80 to \xBF are treated as \x7F.
  • Standalone \xC0 to \xF4 throw error "too short multibyte code string". This feels like a bug.
  • Standalone \xF5 to \xFF are treated as \x7F.
    • If the range is within a negated, non-nested character class (ex: [^\0-\xFF]), \xF5 to \xFF are handled like \x{10FFFF}. This feels like a bug. For example, [^\x01-\xFF] matches only \x00. The bug can be worked around by nesting the class. For example, [[^\0-\xFF]] matches \x80 and higher.

Disclosure: I'm testing using Oniguruma 6.9.8 via vscode-oniguruma. However, the release notes for subsequent versions don't mention any related changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions