Skip to content

Add fast path for ASCII case folding #159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 10, 2022

Conversation

mszabo-wikia
Copy link
Contributor

One of our production services uses re2j to match several hundred mostly
case-insensitive patterns of varying complexity against text.
We observed that approximately 12% of CPU time was being spent in
toLowerCase() as called from simpleFold(), due to the necessity of doing
at least one character data lookup per Inst.Rune in the common case that
the input rune being examined did not match the instruction.

As a fix, implement a method equalsIgnoreCase() that performs
Unicode-aware case-insensitive comparison between two runes, with a fast
path for the common case where both input runes are ASCII, and use it in
Inst for single-rune case-insensitive comparison. This takes character
data lookups out of the hot path.

The existing re2j benchmarks did not exercise case-insensitive patterns,
so add a new benchmark that executes a mostly ASCII regex pattern on a
text containing a mix of ASCII and Unicode characters (generated using
a Hungarian "lorem ipsum" text generator).

Also add unit tests for the new equality comparison logic.

@mszabo-wikia
Copy link
Contributor Author

I've cherry-picked the newly added benchmarks on top of master as well for a baseline comparison (this is available at https://github.com/mszabo-wikia/re2j/tree/case-insensitive-benchmark). I executed the benchmarks using OpenJDK 8u332 on a 2021 M1 Pro.

Here are the baseline results:

Benchmark                                                 (binary)  (impl)  Mode  Cnt    Score   Error  Units
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch      true     JDK  avgt    5  190.770 ± 1.810  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch      true    RE2J  avgt    5  716.234 ± 1.442  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch     false     JDK  avgt    5  196.788 ± 1.365  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch     false    RE2J  avgt    5  791.140 ± 2.569  us/op

And the results with the ASCII case folding improvements from this patch:

Benchmark                                                 (binary)  (impl)  Mode  Cnt    Score   Error  Units
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch      true     JDK  avgt    5  196.541 ± 0.647  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch      true    RE2J  avgt    5  686.977 ± 1.666  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch     false     JDK  avgt    5  178.767 ± 0.875  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch     false    RE2J  avgt    5  699.842 ± 9.484  us/op

// for the likely scenario where both runes are ASCII characters.
static boolean equalsIgnoreCase(int r1, int r2) {
// Runes already match, or one of them is EOF
if (r1 < 0 || r2 < 0 || r1 == r2) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That -1 means EOF should be noted in the doc comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 da63ef6

Copy link
Collaborator

@adonovan adonovan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only looked at the production logic, but LGTM.

One of our production services uses re2j to match several hundred mostly
case-insensitive patterns of varying complexity against text.
We observed that approximately 12% of CPU time was being spent in
toLowerCase() as called from simpleFold(), due to the necessity of doing
at least one character data lookup per Inst.Rune in the common case that
the input rune being examined did not match the instruction.

As a fix, implement a method equalsIgnoreCase() that performs
Unicode-aware case-insensitive comparison between two runes, with a fast
path for the common case where both input runes are ASCII, and use it in
Inst for single-rune case-insensitive comparison. This takes character
data lookups out of the hot path.

The existing re2j benchmarks did not exercise case-insensitive patterns,
so add a new benchmark that executes a mostly ASCII regex pattern on a
text containing a mix of ASCII and Unicode characters (generated using
a Hungarian "lorem ipsum" text generator).

Also add unit tests for the new equality comparison logic.

Signed-off-by: Máté Szabó <mszabo@fandom.com>
@mszabo-wikia mszabo-wikia force-pushed the ascii-case-fold-fast-path branch from 784eac3 to da63ef6 Compare July 7, 2022 23:05
@codecov-commenter
Copy link

Codecov Report

Merging #159 (da63ef6) into master (3e685d9) will decrease coverage by 0.07%.
The diff coverage is 78.57%.

@@            Coverage Diff             @@
##           master     #159      +/-   ##
==========================================
- Coverage   89.13%   89.06%   -0.08%     
==========================================
  Files          19       19              
  Lines        3029     3037       +8     
  Branches      613      618       +5     
==========================================
+ Hits         2700     2705       +5     
- Misses        188      189       +1     
- Partials      141      143       +2     
Impacted Files Coverage Δ
java/com/google/re2j/Inst.java 80.85% <50.00%> (-3.47%) ⬇️
java/com/google/re2j/Unicode.java 68.75% <83.33%> (+4.86%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e685d9...da63ef6. Read the comment docs.

@sjamesr sjamesr merged commit dc7d6e5 into google:master Jul 10, 2022
@mszabo-wikia
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants