-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add experimental allowlist optimizations #1731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
40cba3e
to
9ce24de
Compare
Also + @ahrav |
f051219
to
b73c7e6
Compare
config/allowlist.go
Outdated
a.commitMap = uniqueCommits | ||
} | ||
|
||
if len(a.Paths) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rgmz I was thinking about doing something similar in my upstream patterns when they're "compiled". It'd be neat to have it baked in too.
So this does give a good performance boost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this does give a good performance boost?
The BenchmarkPathAllowed
and BenchmarkPathNotAllowed
benchmarks look to be 12-18x faster. We'd need to do some real-world testing to see if that carries over in practice.
Do you know of any large repositories with an excessive number of ignore paths? Node modules, Python deps, media files, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't off the top of my head, but it probably wouldn't be to hard to make one through automation.
You could generate a list of ignore paths that you'd like to make and then do something like:
target_number_of_commits = 1000
for target_number_of_commits:
# the files to ignore
allowlist_paths_for_commit = paths.choose(random_int(len(allow_list_example_paths) - 1)
for path in allowlist_paths_for_commit:
prefix = random_path(max_depth=30)
create_file_with_parent_dirs(join_path(prefix, path), filesize=random_int(1024))
# other random files
for i in range(pow(len(allowlist_paths_for_commit), 2)):
prefix = random_path(max_depth=30)
create_file_with_parent_dirs(join_path(prefix, path), filesize=random_int(1024))
commit()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential candidate is mattermost/mattermost-data-warehouse which takes ~45s-1m to scan. (It isn't consistent, and my laptop is generally quite slow.)
Yeah anything that we can do to keep state/caches on the detector would be handy for my use case but I can always tweak things. I load the config once and then have a process that continually scans using gitleaks as a lib. I do create a fresh detector each scan though. |
b73c7e6
to
2c94106
Compare
This is really nifty! Based on my limited understanding of Gitleaks, this seems like a solid use of both the cache and Aho-Corasick. One small consideration—aside from the build-time drawback—is the additional memory overhead required to store the structure. However, from my testing of Aho-Corasick (specifically the library used here), the tradeoff is well worth it. For my own understanding—apologies if I’m missing something about this codebase or the scanning approach—how do these optimizations affect hot paths? Does a typical scan run through all the optimized sections, or are there specific hotspots? Also, out of curiosity, do you have a CPU or trace profile that guided these optimizations? P.S: ❤️ the benchmarks |
I suspect it runs through most, however, I don't have any data to support that. It would be interesting to gather. Some users maintain their own config, others (likely most) use the default config. Moreover, the same config can have wildly different performance depending on what it's scanning. Every fragment is evaluated against the global
I do not. Are there any resources you recommend to get started with that? |
Thanks for the quick explanation. 🙇 Based on that, I’d expect these changes to have a noticeable impact. The Go Blog is a great starting point, as is this guide. |
06d47ca
to
83708bd
Compare
83708bd
to
7427792
Compare
@zricethezav I've moved the new logic behind a feature-flag, in case we want to test this "in the field" and compare results. |
bd3f576
to
a087f16
Compare
a087f16
to
25f2d70
Compare
On a large (15GB) repository, this reduced the scanning time from ~46m to ~32m. Before
After
|
25f2d70
to
13b4f91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zricethezav This isn't perfect but it's ready for review.
config/allowlist.go
Outdated
} | ||
} | ||
return false, "" | ||
} | ||
|
||
func (a *Allowlist) Validate() error { | ||
flagOnce.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on an 'optional' call to Validate
to set internal state is bad. A better idea is probably to make a NewAllowlist(...opts)
function, but I didn't want to add noise before this change got reviewed.
13b4f91
to
8051dd6
Compare
75b058f
to
264f2d8
Compare
264f2d8
to
227a4a0
Compare
85683f9
to
ff80727
Compare
this'll be up next after #1872 |
ff80727
to
9005a8b
Compare
9005a8b
to
ee86073
Compare
I think this is good to go. @rgmz wdyt about giving this 2-3 weeks to bake as experimental then making this optimization the default behavior? |
That sounds good to me! What do you think about @bplaxco's suggestion to use a field on Config instead of a global feature flag? |
Sure, that sounds reasonable 👍🏻 |
0568322
to
7ae6fbf
Compare
This comment was marked as resolved.
This comment was marked as resolved.
7ae6fbf
to
3caec4d
Compare
"Crisp and clean, no caffeine" 👌👌 |
Description:
These are experimental changes to allowlist to improve efficiency.
Ideas include the following -- not all of them are implemented yet, or obviously better:
Commits
Regexes
into a single patternPaths
into a single patternStopwords
instead of a sliceCommits
Paths
Regexes
cc @zricethezav @bplaxco
Potential Drawbacks
Off the top of my head, potential drawbacks could be:
Allowlist
struct, could mess with how some people use Gitleaks as a libraryChecklist: