-
Notifications
You must be signed in to change notification settings - Fork 335
Flag W
should not interfere with Unicode case folding from flag i
#349
Description
Currently, flag \W
and \P
prevent Unicode case folding (via flag i
) from applying to \w
and its friends (\b
, \p{Word}
, [:word:]
, and inversions). But this is handled inconsistently, depending on nonintuitive factors. I know that this is by design (this issue was previously discussed in #264 from @tonco-miyazawa), but I want to argue that (in all cases) \W
and \P
should allow Unicode case folding when flag i
is enabled (if whole-pattern modifier (?I)
is not enabled).
In other words, \w
with flags iW
should always match the same characters as [0-9A-Z_a-zſK]
(including U+017F "small long s" and U+212A Kelvin). (Of course, the same logic should apply to \w
's friends, plus everything affected by flag P
, but it's easier to focus on just \w
here.)
To describe this in rules:
- Flag
W
/P
should always change\w
to match ASCII-only[0-9A-Z_a-z]
. - Flag
i
should always expand ASCIIs
/S
to includeſ
andk
/K
to includeK
, based on Unicode case folding. - Flag
W
/P
should not change the above behavior of flagi
.
In addition to the above rules being logical and simple, there is precedent in JavaScript. JavaScript regexes use ASCII-only \w
(and \b
, etc.). And, when using Unicode-aware case insensitivity (via flags iu
or iv
), JavaScript's \w
is equivalent to [0-9A-Z_a-zſK]
.
Currently, the rules above are already followed by Oniguruma in some cases. Specifically, when using flags (?iW)
and the \w
appears inside of a character class (if the outermost class is not negated). This is shown in the following match results.
✅ = match
❌ = no match
🚩 = the result is not what I think it should be
Target string: 'ſ' U+017F
-----------------------
s // ❌
(?i)s // ✅
\w // ✅
(?W)\w // ❌
-----------------------
(?iW)\w // ❌ 🚩
(?iW)[\w] // ✅
(?iW)[^\W] // ❌ 🚩
(?iW)[[^\W]] // ✅
(?iW)[^[\W]] // ❌ 🚩
(?iW)[\w&&[^\W]] // ✅
As I mentioned at the top, this inconsistency was previously discussed in #264. And it was documented here:
* (?i) option has no effect on word types (\w, \p{Word}). However, if the word types are used within a character class, it is valid. But, this would only be a concern when word types are used with the (?W) option.
I agree that it is odd that these four results do not match.
However, for word type, it is difficult to have exactly the same behavior because of the particular implementation method.
Therefore, only word type has a special specification for IGNORECASE. But if you are not using the (?W) option, you probably don't need to worry about it.
Fair enough. But two things:
- The behavior is more complicated than just whether
\w
is in a character class, since the negation status of the outermost class matters, as shown in the examples above. That makes it quite unpredictable for users. - I don't know, but maybe the assumption before was that the ideal behavior for flag
W
was to always make\w
match[0-9A-Z_a-z]
, even with flagi
. Could it be that having flagsiW
result in matching[0-9A-Z_a-zſK]
would make it easier for the implementation methods to be consistent?
Also, I'm just guessing, but maybe the implementation methods are harder to make consistent because of the how Oniguruma also sometimes applies Unicode's SpecialCasing.txt (like ß
↔ ss
)? I've filed #351 to see if it's possible to change that behavior, too.