Skip to content

RegExp processing unicode+ignoreCase of \W is not the same as !\w when matching "S" or "K" #512

@msaboff

Description

@msaboff

A bug has been reported against both V8 bug and JSC bug concerning whether or not the strings "K" and "S" match against the regular expression /\W/ui, i.e. not a word character with the unicode and ignoreCase flags. The submitter of at least one of those bugs describes the behavior here.

This raises a possible issue with what was intended for case insensitive Unicode regular expressions. A direct reading the relevant sections of the ECMAScript® 2016 standard is that both /\W/ui.test("K") and /\W/ui.test("S") should return true, but that may not be what the standard intends.

The \W and related \w, are described in section 21.2.2.12 CharacterClassEscape. It states that word characters are matched with the \w CharacterClassEscape which consist of the set of {a .. z A .. Z 0 .. 9 _}. The \W CharacterClassEscape is the inverse of \w.

The creation of the character class from a character class escape is described in section 21.2.2.9 AtomEscape. In all cases, we create a character class with the inverse flag as false.

The case folding rules are defined in section 21.2.2.8.2 Runtime Semantics: Canonicalize. It states that for unicode case folding, the table in the file CaseFolding.txt from the Unicode Character Database is consulted and if there are common or simple case folding mappings for a character, that mapping is returned, otherwise the original character is returned.

The matching rules for character set matching is defined in section 21.2.2.8.1 Runtime Semantics: CharacterSetMatcher Abstract Operation. It states in step 1.c. that a character is retrieved from a valid index in the subject string. In step 1.d, that character is canonicalized. In steps 1.e.i. or 1.f.i., depending on the invert flag, the result of the Canonicalize() from step 1.d. is compared against the the members of the set after each of the members is processed by Canonicalize(). In step 1.e.i., if there isn't a character in the set whose Canonicalize() result is the same as the result from step 1.d. return failure. In step 1.f.i., if there is a character in the set whose Canonicalize() result is the same as the result from step 1.d. return failure.

Putting these sections together, let's walk through these sections for the construct /\W/ui.test("S")..

The regular expression consists of the character class escape '\W', which includes all the characters not in the \w word character class. It also has the unicode and ignoreCase flags set. Matching will happen according to CharacterSetMatcher Abstract Operation, with the set A being all characters not in the \w class and the invert flag being false. At step 1.d., the 'S' character will be read and passed to Canonicalize(). The result according to CaseFolding.txt, will be 's'. Since invert is false, the matching will proceed to step 1.e.i.. There exists one character in set A, \u017f (lower case long s) that will also case fold to 's'. Therefore we will not return failure, but will instead match. The same analysis of "K" holds as there exists one character \u212a (Kelvin symbol) that will canonicalize to 'k'.

The issue is whether or not this behavior was intended. Without the unicode flag, the processing of \W is the same as \w. For the unicode behavior to match the non-unicode behavior, the description for producing a character class from a character class escape needs to change for all of the upper case character class escape to produce a character class identical to the lower case equivalent with the inverse flag as true.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions