Add experimental allowlist optimizations #1731

rgmz · 2025-01-26T23:05:23Z

Description:

These are experimental changes to allowlist to improve efficiency.

Ideas include the following -- not all of them are implemented yet, or obviously better:

Using a map instead of slices for Commits
Collapsing Regexes into a single pattern
Collapsing Paths into a single pattern
Using ahocorasick for Stopwords instead of a slice
Creating benchmarks of varying sizes to get a more accurate portrayal of performance

Commits

# Before
BenchmarkCommitAllowed
BenchmarkCommitAllowed-4      	 3539479	       335.0 ns/op
BenchmarkCommitNotAllowed
BenchmarkCommitNotAllowed-4   	 2950525	       403.2 ns/op

# After
BenchmarkCommitAllowed
BenchmarkCommitAllowed-4      	16353297	        79.87 ns/op
BenchmarkCommitNotAllowed
BenchmarkCommitNotAllowed-4   	17393956	        69.91 ns/op

Paths

# Before
BenchmarkPathAllowed
BenchmarkPathAllowed-4        	   68570	     15523 ns/op
BenchmarkPathNotAllowed
BenchmarkPathNotAllowed-4     	   41542	     27972 ns/op

# After
BenchmarkPathAllowed
BenchmarkPathAllowed-4        	  922988	      1256 ns/op
BenchmarkPathNotAllowed
BenchmarkPathNotAllowed-4     	  767101	      1474 ns/op

Regexes

# Before
BenchmarkRegexAllowed
BenchmarkRegexAllowed-4       	  100308	     10873 ns/op
BenchmarkRegexNotAllowed
BenchmarkRegexNotAllowed-4    	   38066	     31017 ns/op

# After
BenchmarkRegexAllowed
BenchmarkRegexAllowed-4       	 1243234	       947.4 ns/op
BenchmarkRegexNotAllowed
BenchmarkRegexNotAllowed-4    	  835924	      1572 ns/op

cc @zricethezav @bplaxco

Potential Drawbacks

Off the top of my head, potential drawbacks could be:

Slower startup speed to build an ahocorasick trie for every allowlist
Faster performance for larger allowlists, at the cost of slower or equivalent performance for smaller allowlists
Changes the semantics of the Allowlist struct, could mess with how some people use Gitleaks as a library

Checklist:

Does your PR pass tests?
Have you written new tests for your changes?
Have you lint your code locally prior to submission?

rgmz · 2025-01-26T23:22:42Z

Also + @ahrav

config/allowlist.go

bplaxco · 2025-01-28T14:28:25Z

config/allowlist.go

+		a.commitMap = uniqueCommits
+	}
+
+	if len(a.Paths) > 0 {


@rgmz I was thinking about doing something similar in my upstream patterns when they're "compiled". It'd be neat to have it baked in too.

So this does give a good performance boost?

So this does give a good performance boost?

The BenchmarkPathAllowed and BenchmarkPathNotAllowed benchmarks look to be 12-18x faster. We'd need to do some real-world testing to see if that carries over in practice.

Do you know of any large repositories with an excessive number of ignore paths? Node modules, Python deps, media files, etc.

I don't off the top of my head, but it probably wouldn't be to hard to make one through automation.

You could generate a list of ignore paths that you'd like to make and then do something like:

target_number_of_commits = 1000 for target_number_of_commits: # the files to ignore allowlist_paths_for_commit = paths.choose(random_int(len(allow_list_example_paths) - 1) for path in allowlist_paths_for_commit: prefix = random_path(max_depth=30) create_file_with_parent_dirs(join_path(prefix, path), filesize=random_int(1024)) # other random files for i in range(pow(len(allowlist_paths_for_commit), 2)): prefix = random_path(max_depth=30) create_file_with_parent_dirs(join_path(prefix, path), filesize=random_int(1024)) commit()

One potential candidate is mattermost/mattermost-data-warehouse which takes ~45s-1m to scan. (It isn't consistent, and my laptop is generally quite slow.)

bplaxco · 2025-01-28T14:32:13Z

Allowlist struct, could mess with how some people use Gitleaks as a library

Yeah anything that we can do to keep state/caches on the detector would be handy for my use case but I can always tweak things. I load the config once and then have a process that continually scans using gitleaks as a lib. I do create a fresh detector each scan though.

ahrav · 2025-01-29T03:50:38Z

Also + @ahrav

This is really nifty! Based on my limited understanding of Gitleaks, this seems like a solid use of both the cache and Aho-Corasick. One small consideration—aside from the build-time drawback—is the additional memory overhead required to store the structure. However, from my testing of Aho-Corasick (specifically the library used here), the tradeoff is well worth it.

For my own understanding—apologies if I’m missing something about this codebase or the scanning approach—how do these optimizations affect hot paths? Does a typical scan run through all the optimized sections, or are there specific hotspots? Also, out of curiosity, do you have a CPU or trace profile that guided these optimizations?

P.S: ❤️ the benchmarks

rgmz · 2025-01-29T13:40:09Z

For my own understanding—apologies if I’m missing something about this codebase or the scanning approach—how do these optimizations affect hot paths? Does a typical scan run through all the optimized sections, or are there specific hotspots?

I suspect it runs through most, however, I don't have any data to support that. It would be interesting to gather.

Some users maintain their own config, others (likely most) use the default config. Moreover, the same config can have wildly different performance depending on what it's scanning. Every fragment is evaluated against the global [allowlist]. Every finding is evaluated against the rule's allowlist, if present (e.g., curl-auth-user), but no rule is guaranteed to hit. Naturally, generic-api-key is most likely to hit, and also has the most complicated allowlist.

Also, out of curiosity, do you have a CPU or trace profile that guided these optimizations?

I do not. Are there any resources you recommend to get started with that?

ahrav · 2025-01-29T22:28:43Z

For my own understanding—apologies if I’m missing something about this codebase or the scanning approach—how do these optimizations affect hot paths? Does a typical scan run through all the optimized sections, or are there specific hotspots?

I suspect it runs through most, however, I don't have any data to support that. It would be interesting to gather.

Some users maintain their own config, others (likely most) use the default config. Moreover, the same config can have wildly different performance depending on what it's scanning. Every fragment is evaluated against the global [allowlist]. Every finding is evaluated against the rule's allowlist, if present (e.g., curl-auth-user), but no rule is guaranteed to hit. Naturally, generic-api-key is most likely to hit, and also has the most complicated allowlist.

Also, out of curiosity, do you have a CPU or trace profile that guided these optimizations?

I do not. Are there any resources you recommend to get started with that?

Thanks for the quick explanation. 🙇 Based on that, I’d expect these changes to have a noticeable impact. The Go Blog is a great starting point, as is this guide.

rgmz · 2025-02-15T22:13:15Z

@zricethezav I've moved the new logic behind a feature-flag, in case we want to test this "in the field" and compare results.

rgmz · 2025-03-02T00:40:05Z

On a large (15GB) repository, this reduced the scanning time from ~46m to ~32m.

Before

(1) 2:53PM INF scanned ~15450035936 bytes (15.45 GB) in 46m53s
...
(2) 7:27PM INF scanned ~15450035936 bytes (15.45 GB) in 46m7s

After

(1) 5:25PM INF scanned ~15450035936 bytes (15.45 GB) in 32m58s
...
(2) 6:15PM INF scanned ~15450035936 bytes (15.45 GB) in 32m41s

rgmz

@zricethezav This isn't perfect but it's ready for review.

rgmz · 2025-03-02T16:01:07Z

config/allowlist.go

 		}
 	}
 	return false, ""
 }

 func (a *Allowlist) Validate() error {
+	flagOnce.Do(func() {


Relying on an 'optional' call to Validate to set internal state is bad. A better idea is probably to make a NewAllowlist(...opts) function, but I didn't want to add noise before this change got reviewed.

zricethezav · 2025-05-31T14:18:06Z

this'll be up next after #1872

config/allowlist.go

zricethezav · 2025-06-06T16:21:43Z

I think this is good to go. @rgmz wdyt about giving this 2-3 weeks to bake as experimental then making this optimization the default behavior?

rgmz · 2025-06-06T22:54:57Z

I think this is good to go. @rgmz wdyt about giving this 2-3 weeks to bake as experimental then making this optimization the default behavior?

That sounds good to me! What do you think about @bplaxco's suggestion to use a field on Config instead of a global feature flag?

zricethezav · 2025-06-07T01:21:20Z

That sounds good to me! What do you think about #1731 (comment)?

Sure, that sounds reasonable 👍🏻

bplaxco · 2025-06-08T15:21:10Z

"Crisp and clean, no caffeine" 👌👌

rgmz force-pushed the feat/allowlist-ahocorasick branch 2 times, most recently from 40cba3e to 9ce24de Compare January 26, 2025 23:22

rgmz force-pushed the feat/allowlist-ahocorasick branch 2 times, most recently from f051219 to b73c7e6 Compare January 26, 2025 23:45

bplaxco reviewed Jan 28, 2025

View reviewed changes

config/allowlist.go Outdated Show resolved Hide resolved

bplaxco reviewed Jan 28, 2025

View reviewed changes

rgmz force-pushed the feat/allowlist-ahocorasick branch from b73c7e6 to 2c94106 Compare January 28, 2025 21:42

rgmz force-pushed the feat/allowlist-ahocorasick branch 5 times, most recently from 06d47ca to 83708bd Compare February 15, 2025 22:10

rgmz marked this pull request as ready for review February 15, 2025 22:11

rgmz force-pushed the feat/allowlist-ahocorasick branch from 83708bd to 7427792 Compare February 15, 2025 22:12

rgmz changed the title ~~Experiment: allowlist optimizations~~ Add experimental allowlist optimizations Feb 15, 2025

rgmz force-pushed the feat/allowlist-ahocorasick branch 3 times, most recently from bd3f576 to a087f16 Compare February 15, 2025 22:39

rgmz force-pushed the feat/allowlist-ahocorasick branch from a087f16 to 25f2d70 Compare March 1, 2025 14:56

rgmz force-pushed the feat/allowlist-ahocorasick branch from 25f2d70 to 13b4f91 Compare March 2, 2025 15:55

rgmz commented Mar 2, 2025

View reviewed changes

rgmz mentioned this pull request Mar 19, 2025

Initialization performance regression due to switch to go-re2 #1796

Closed

rgmz force-pushed the feat/allowlist-ahocorasick branch from 13b4f91 to 8051dd6 Compare April 6, 2025 16:44

rgmz force-pushed the feat/allowlist-ahocorasick branch 4 times, most recently from 75b058f to 264f2d8 Compare April 13, 2025 14:55

rgmz force-pushed the feat/allowlist-ahocorasick branch from 264f2d8 to 227a4a0 Compare April 30, 2025 21:28

rgmz force-pushed the feat/allowlist-ahocorasick branch 3 times, most recently from 85683f9 to ff80727 Compare May 19, 2025 15:04

rgmz force-pushed the feat/allowlist-ahocorasick branch from ff80727 to 9005a8b Compare June 1, 2025 17:15

bplaxco reviewed Jun 4, 2025

View reviewed changes

config/allowlist.go Outdated Show resolved Hide resolved

bplaxco reviewed Jun 4, 2025

View reviewed changes

config/allowlist.go Show resolved Hide resolved

bplaxco reviewed Jun 5, 2025

View reviewed changes

config/allowlist.go Outdated Show resolved Hide resolved

config/allowlist.go Show resolved Hide resolved

config/allowlist.go Outdated Show resolved Hide resolved

config/allowlist.go Outdated Show resolved Hide resolved

feat(allowlist): optimization experiments

ee86073

rgmz force-pushed the feat/allowlist-ahocorasick branch from 9005a8b to ee86073 Compare June 6, 2025 01:31

rgmz added 2 commits June 6, 2025 08:42

refactor(allowlist): remove useOptimizations flag

c22982d

refactor(allowlist): simplify Validate() with helper func

d54c1e3

rgmz force-pushed the feat/allowlist-ahocorasick branch from 0568322 to 7ae6fbf Compare June 7, 2025 13:30

This comment was marked as resolved.

Sign in to view

refactor(config): replace global feature flag

3caec4d

rgmz force-pushed the feat/allowlist-ahocorasick branch from 7ae6fbf to 3caec4d Compare June 7, 2025 21:08

zricethezav merged commit 9faaa4a into gitleaks:master Jun 8, 2025
2 checks passed

rgmz deleted the feat/allowlist-ahocorasick branch June 8, 2025 15:42

Uh oh!

Add experimental allowlist optimizations #1731

Add experimental allowlist optimizations #1731

Uh oh!

Conversation

rgmz commented Jan 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Potential Drawbacks

Checklist:

Uh oh!

rgmz commented Jan 26, 2025

Uh oh!

Uh oh!

bplaxco Jan 28, 2025

Choose a reason for hiding this comment

Uh oh!

rgmz Jan 28, 2025

Choose a reason for hiding this comment

Uh oh!

bplaxco Feb 1, 2025

Choose a reason for hiding this comment

Uh oh!

rgmz Feb 15, 2025

Choose a reason for hiding this comment

Uh oh!

bplaxco commented Jan 28, 2025

Uh oh!

ahrav commented Jan 29, 2025

Uh oh!

rgmz commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahrav commented Jan 29, 2025

Uh oh!

rgmz commented Feb 15, 2025

Uh oh!

rgmz commented Mar 2, 2025

Uh oh!

rgmz left a comment

Choose a reason for hiding this comment

Uh oh!

rgmz Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

zricethezav commented May 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zricethezav commented Jun 6, 2025

Uh oh!

rgmz commented Jun 6, 2025

Uh oh!

zricethezav commented Jun 7, 2025

Uh oh!

This comment was marked as resolved.

bplaxco commented Jun 8, 2025

Uh oh!

Uh oh!

Uh oh!

rgmz commented Jan 26, 2025 •

edited

Loading

rgmz commented Jan 29, 2025 •

edited

Loading