RegExp processing unicode+ignoreCase of \W is not the same as !\w when matching "S" or "K"

A bug has been reported against both V8 [bug](https://github.com/nodejs/node/issues/5948) and JSC [bug](https://bugs.webkit.org/show_bug.cgi?id=151597) concerning whether or not the strings "K" and "S" match against the regular expression /\W/ui, i.e. not a word character with the unicode and ignoreCase flags.  The submitter of at least one of those bugs describes the behavior [here](https://mathiasbynens.be/notes/es6-unicode-regex#impact-i.).

This raises a possible issue with what was intended for case insensitive Unicode regular expressions.  A direct reading the relevant sections of the ECMAScript® 2016 standard is that both  /\W/ui.test("K") and /\W/ui.test("S") should return true, but that may not be what the standard intends.

The \W and related \w, are described in section [21.2.2.12 CharacterClassEscape](https://tc39.github.io/ecma262/2016/#sec-runtime-semantics-iswordchar-abstract-operation).  It states that word characters are matched with the \w CharacterClassEscape which consist of the set of {a .. z A .. Z 0 .. 9 _}.  The \W CharacterClassEscape is the inverse of \w.

The creation of the character class from a character class escape is described in section [21.2.2.9 AtomEscape](https://tc39.github.io/ecma262/2016/#sec-atomescape).  In all cases, we create a character class with the inverse flag as false.

The case folding rules are defined in section [21.2.2.8.2 Runtime Semantics: Canonicalize](https://tc39.github.io/ecma262/2016/#sec-runtime-semantics-canonicalize-ch).  It states that for unicode case folding, the table in the file CaseFolding.txt from the Unicode Character Database is consulted and if there are common or simple case folding mappings for a character, that mapping is returned, otherwise the original character is returned.

The matching rules for character set matching is defined in section [21.2.2.8.1 Runtime Semantics: CharacterSetMatcher Abstract Operation](https://tc39.github.io/ecma262/2016/#sec-runtime-semantics-charactersetmatcher-abstract-operation). It states in step 1.c. that a character is retrieved from a valid index in the subject string. In step 1.d, that character is canonicalized.  In steps 1.e.i. or 1.f.i., depending on the _invert_ flag, the result of the Canonicalize() from step 1.d. is compared against the the members of the set after each of the members is processed by Canonicalize(). In step 1.e.i., if there isn't a character in the set whose Canonicalize() result is the same as the result from step 1.d. return failure.  In step 1.f.i., if there is a character in the set whose Canonicalize() result is the same as the result from step 1.d. return failure.

Putting these sections together, let's walk through these sections for the construct /\W/ui.test("S")..

The regular expression consists of the character class escape '\W', which includes all the characters not in the \w word character class. It also has the unicode and ignoreCase flags set.  Matching will happen according to CharacterSetMatcher Abstract Operation, with the set A being all characters not in the \w class and the invert flag being false.  At step 1.d., the 'S' character will be read and passed to Canonicalize().  The result according to CaseFolding.txt, will be 's'.  Since _invert_ is `false`, the matching will proceed to step 1.e.i..  There exists one character in set A, \u017f (lower case long s) that will also case fold to 's'.  Therefore we will not return failure, but will instead match.  The same analysis of "K" holds as there exists one character \u212a (Kelvin symbol) that will canonicalize to 'k'.

The issue is whether or not this behavior was intended.  Without the unicode flag, the processing of \W is the same as \w.  For the unicode behavior to match the non-unicode behavior, the description for producing a character class from a character class escape needs to change for all of the upper case character class escape to produce a character class identical to the lower case equivalent with the inverse flag as true.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RegExp processing unicode+ignoreCase of \W is not the same as !\w when matching "S" or "K" #512

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RegExp processing unicode+ignoreCase of \W is not the same as !\w when matching "S" or "K" #512

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions