Skip to content

Conversation

twpayne
Copy link
Contributor

@twpayne twpayne commented Mar 19, 2025

Description:

Fixes #1796.

This switches the default regular expression package back to Go's standard regexp package. Users can still opt-in to using github.com/wasilabs/go-re2 by building with the gore2regex build tag.

See the discussion in #1796. This regexp will likely be a significant performance improvement in all cases except very long-running processes. Specifically, go-re2 has very high initialization and regular expression compilation costs that the standard library does not.

Checklist:

  • Does your PR pass tests? Yes -- all existing tests pass.
  • Have you written new tests for your changes? No -- they are already covered by existing tests.
  • Have you lint your code locally prior to submission? Yes -- there are lint errors in the current code, but they are unrelated to this PR.

@rgmz
Copy link
Contributor

rgmz commented Mar 19, 2025

This regexp will likely be a significant performance improvement in all cases except very long-running processes. Specifically, go-re2 has very high initialization and regular expression compilation costs that the standard library does not.

IIRC go-re2 is faster in all cases, at the cost of increased startup time and memory usage.

@twpayne
Copy link
Contributor Author

twpayne commented Mar 19, 2025

Picking a random line from go-re2's benchmarks:

name \ time/op                  build/bench_stdlib.txt  build/bench.txt   build/bench_cgo.txt

Match/Medium/1M-4                     29406989.0n ± ∞ ¹      42947.0n ± ∞ ¹     -99.85% (p=0.008 n=5)     154.6n ± ∞ ¹   -100.00% (p=0.008 n=5)

The standard library takes 29.4ns per medium match, whereas go-re2 takes only 0.043ns per medium match (this is very suspicious: I wonder if there's a problem with their benchmarks). Given that go-re2 has an initialization cost of 160ms, this means that go-re2 will be faster after 160ms/29.4ns = 5,442,176.9 matches.

@rgmz
Copy link
Contributor

rgmz commented Mar 19, 2025

Given that go-re2 has an initialization cost of 160ms, this means that go-re2 will be faster after 160ms/29.4ns = 5,442,176.9 matches.

I don't know if the calculation is as straightforward, it'd be interesting to factor in almost matches.

I'm not against the change, I just want to be clear on the pros/cons. For most users there's a coin-flip whether the scan will be short or long(er) depending on the size, type, and contents of the input, so faster startup isn't necessarily always better.

@anuraaga
Copy link

anuraaga commented Mar 19, 2025

Just to clarify that benchmark, I think you're dividing because of the 1M but that is the size of the input string, not number of matches. For a 1MB input, the stdlib takes 29ms vs 42us. The microbenchmarks are contrived examples though taken straight from golang/go and aren't that representative of real-world. wafbench is better since it's real regexes, albeit a specialized case, for 10KB input it's 150ms vs 20ms.

Every project should decide about the tradeoff themselves ideally based on real-world data, as the first paragraphs of the README state - given how large I see lockfiles get in many projects, maybe it is here though.

@zricethezav
Copy link
Collaborator

simple as 🙇🏻

Thanks @twpayne!

@zricethezav zricethezav merged commit 6cc0e38 into gitleaks:master Mar 21, 2025
2 checks passed
@twpayne twpayne deleted the standard-regexp branch March 21, 2025 09:23
sirakav pushed a commit to sirakav/gitleaks that referenced this pull request Apr 25, 2025
alayne222 pushed a commit to alayne222/gitleaks that referenced this pull request May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Initialization performance regression due to switch to go-re2
4 participants