Flag `W` should not interfere with Unicode case folding from flag `i`

Currently, flag `\W` and `\P` prevent Unicode case folding (via flag `i`) from applying to `\w` and its friends (`\b`, `\p{Word}`, `[:word:]`, and inversions). But this is handled inconsistently, depending on nonintuitive factors. I know that this is by design (this issue was previously discussed in #264 from @tonco-miyazawa), but I want to argue that (in all cases) `\W` and `\P` should *allow* Unicode case folding when flag `i` is enabled (if whole-pattern modifier `(?I)` is not enabled).

In other words, `\w` with flags `iW` should always match the same characters as `[0-9A-Z_a-zſK]` (including U+017F "small long s" and U+212A Kelvin). (Of course, the same logic should apply to `\w`'s friends, plus everything affected by flag `P`, but it's easier to focus on just `\w` here.)

To describe this in rules:

- Flag `W`/`P` should always change `\w` to match ASCII-only `[0-9A-Z_a-z]`.
- Flag `i` should always expand ASCII `s`/`S` to include `ſ` and `k`/`K` to include `K`, based on Unicode case folding.
- Flag `W`/`P` should not change the above behavior of flag `i`.

In addition to the above rules being logical and simple, there is precedent in JavaScript. JavaScript regexes use ASCII-only `\w` (and `\b`, etc.). And, when using Unicode-aware case insensitivity (via flags `iu` or `iv`), JavaScript's `\w` is equivalent to `[0-9A-Z_a-zſK]`.

Currently, the rules above are already followed by Oniguruma in *some* cases. Specifically, when using flags `(?iW)` and the `\w` appears inside of a character class (if the outermost class is not negated). This is shown in the following match results.

✅ = match
❌ = no match
🚩 = the result is not what I think it should be

```
Target string: 'ſ' U+017F
-----------------------
s                 // ❌
(?i)s             // ✅
\w                // ✅
(?W)\w            // ❌
-----------------------
(?iW)\w           // ❌ 🚩
(?iW)[\w]         // ✅
(?iW)[^\W]        // ❌ 🚩
(?iW)[[^\W]]      // ✅
(?iW)[^[\W]]      // ❌ 🚩
(?iW)[\w&&[^\W]]  // ✅
```

As I mentioned at the top, this inconsistency was previously discussed in #264. And it was documented [here](https://github.com/kkos/oniguruma/blob/09604e72328401a28aab08020b13ffc5ac828833/doc/RE#L290):

> \* (?i) option has no effect on word types (\w, \p{Word}). However, if the word types are used within a character class, it is valid. But, this would only be a concern when word types are used with the (?W) option.

In #264, @kkos [wrote](https://github.com/kkos/oniguruma/issues/264#issuecomment-1166383458):

> I agree that it is odd that these four results do not match.
> However, for word type, it is difficult to have exactly the same behavior because of the particular implementation method.
> Therefore, only word type has a special specification for IGNORECASE. But if you are not using the (?W) option, you probably don't need to worry about it.

Fair enough. But two things:

- The behavior is more complicated than just whether `\w` is in a character class, since the negation status of the outermost class matters, as shown in the examples above. That makes it quite unpredictable for users.
- I don't know, but maybe the assumption before was that the ideal behavior for flag `W` was to always make `\w` match `[0-9A-Z_a-z]`, even with flag `i`. Could it be that having flags `iW` result in matching `[0-9A-Z_a-zſK]` would make it easier for the implementation methods to be consistent?

Also, I'm just guessing, but maybe the implementation methods are harder to make consistent because of the how Oniguruma also sometimes applies Unicode's [SpecialCasing.txt](https://www.unicode.org/Public/UNIDATA/SpecialCasing.txt) (like `ß` ↔ `ss`)? I've filed #351 to see if it's possible to change that behavior, too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flag `W` should not interfere with Unicode case folding from flag `i` #349

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Flag W should not interfere with Unicode case folding from flag i #349

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Flag `W` should not interfere with Unicode case folding from flag `i` #349