Skip to content

Conversation

rgmz
Copy link
Contributor

@rgmz rgmz commented Jan 26, 2025

Description:

These are experimental changes to allowlist to improve efficiency.

Ideas include the following -- not all of them are implemented yet, or obviously better:

  • Using a map instead of slices for Commits
  • Collapsing Regexes into a single pattern
  • Collapsing Paths into a single pattern
  • Using ahocorasick for Stopwords instead of a slice
  • Creating benchmarks of varying sizes to get a more accurate portrayal of performance

Commits

# Before
BenchmarkCommitAllowed
BenchmarkCommitAllowed-4      	 3539479	       335.0 ns/op
BenchmarkCommitNotAllowed
BenchmarkCommitNotAllowed-4   	 2950525	       403.2 ns/op

# After
BenchmarkCommitAllowed
BenchmarkCommitAllowed-4      	16353297	        79.87 ns/op
BenchmarkCommitNotAllowed
BenchmarkCommitNotAllowed-4   	17393956	        69.91 ns/op

Paths

# Before
BenchmarkPathAllowed
BenchmarkPathAllowed-4        	   68570	     15523 ns/op
BenchmarkPathNotAllowed
BenchmarkPathNotAllowed-4     	   41542	     27972 ns/op

# After
BenchmarkPathAllowed
BenchmarkPathAllowed-4        	  922988	      1256 ns/op
BenchmarkPathNotAllowed
BenchmarkPathNotAllowed-4     	  767101	      1474 ns/op

Regexes

# Before
BenchmarkRegexAllowed
BenchmarkRegexAllowed-4       	  100308	     10873 ns/op
BenchmarkRegexNotAllowed
BenchmarkRegexNotAllowed-4    	   38066	     31017 ns/op

# After
BenchmarkRegexAllowed
BenchmarkRegexAllowed-4       	 1243234	       947.4 ns/op
BenchmarkRegexNotAllowed
BenchmarkRegexNotAllowed-4    	  835924	      1572 ns/op

cc @zricethezav @bplaxco

Potential Drawbacks

Off the top of my head, potential drawbacks could be:

  • Slower startup speed to build an ahocorasick trie for every allowlist
  • Faster performance for larger allowlists, at the cost of slower or equivalent performance for smaller allowlists
  • Changes the semantics of the Allowlist struct, could mess with how some people use Gitleaks as a library

Checklist:

  • Does your PR pass tests?
  • Have you written new tests for your changes?
  • Have you lint your code locally prior to submission?

@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch 2 times, most recently from 40cba3e to 9ce24de Compare January 26, 2025 23:22
@rgmz
Copy link
Contributor Author

rgmz commented Jan 26, 2025

Also + @ahrav

@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch 2 times, most recently from f051219 to b73c7e6 Compare January 26, 2025 23:45
a.commitMap = uniqueCommits
}

if len(a.Paths) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rgmz I was thinking about doing something similar in my upstream patterns when they're "compiled". It'd be neat to have it baked in too.

So this does give a good performance boost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this does give a good performance boost?

The BenchmarkPathAllowed and BenchmarkPathNotAllowed benchmarks look to be 12-18x faster. We'd need to do some real-world testing to see if that carries over in practice.

Do you know of any large repositories with an excessive number of ignore paths? Node modules, Python deps, media files, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't off the top of my head, but it probably wouldn't be to hard to make one through automation.

You could generate a list of ignore paths that you'd like to make and then do something like:

target_number_of_commits = 1000
for target_number_of_commits:
   # the files to ignore
   allowlist_paths_for_commit = paths.choose(random_int(len(allow_list_example_paths) - 1)
   for path in allowlist_paths_for_commit:
        prefix = random_path(max_depth=30)
        create_file_with_parent_dirs(join_path(prefix, path), filesize=random_int(1024))

   # other random files
   for i in range(pow(len(allowlist_paths_for_commit), 2)):
       prefix = random_path(max_depth=30)
       create_file_with_parent_dirs(join_path(prefix, path), filesize=random_int(1024))

   commit()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential candidate is mattermost/mattermost-data-warehouse which takes ~45s-1m to scan. (It isn't consistent, and my laptop is generally quite slow.)

@bplaxco
Copy link
Contributor

bplaxco commented Jan 28, 2025

Allowlist struct, could mess with how some people use Gitleaks as a library

Yeah anything that we can do to keep state/caches on the detector would be handy for my use case but I can always tweak things. I load the config once and then have a process that continually scans using gitleaks as a lib. I do create a fresh detector each scan though.

@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch from b73c7e6 to 2c94106 Compare January 28, 2025 21:42
@ahrav
Copy link
Contributor

ahrav commented Jan 29, 2025

Also + @ahrav

This is really nifty! Based on my limited understanding of Gitleaks, this seems like a solid use of both the cache and Aho-Corasick. One small consideration—aside from the build-time drawback—is the additional memory overhead required to store the structure. However, from my testing of Aho-Corasick (specifically the library used here), the tradeoff is well worth it.

For my own understanding—apologies if I’m missing something about this codebase or the scanning approach—how do these optimizations affect hot paths? Does a typical scan run through all the optimized sections, or are there specific hotspots? Also, out of curiosity, do you have a CPU or trace profile that guided these optimizations?

P.S: ❤️ the benchmarks

@rgmz
Copy link
Contributor Author

rgmz commented Jan 29, 2025

For my own understanding—apologies if I’m missing something about this codebase or the scanning approach—how do these optimizations affect hot paths? Does a typical scan run through all the optimized sections, or are there specific hotspots?

I suspect it runs through most, however, I don't have any data to support that. It would be interesting to gather.

Some users maintain their own config, others (likely most) use the default config. Moreover, the same config can have wildly different performance depending on what it's scanning. Every fragment is evaluated against the global [allowlist]. Every finding is evaluated against the rule's allowlist, if present (e.g., curl-auth-user), but no rule is guaranteed to hit. Naturally, generic-api-key is most likely to hit, and also has the most complicated allowlist.

Also, out of curiosity, do you have a CPU or trace profile that guided these optimizations?

I do not. Are there any resources you recommend to get started with that?

@ahrav
Copy link
Contributor

ahrav commented Jan 29, 2025

For my own understanding—apologies if I’m missing something about this codebase or the scanning approach—how do these optimizations affect hot paths? Does a typical scan run through all the optimized sections, or are there specific hotspots?

I suspect it runs through most, however, I don't have any data to support that. It would be interesting to gather.

Some users maintain their own config, others (likely most) use the default config. Moreover, the same config can have wildly different performance depending on what it's scanning. Every fragment is evaluated against the global [allowlist]. Every finding is evaluated against the rule's allowlist, if present (e.g., curl-auth-user), but no rule is guaranteed to hit. Naturally, generic-api-key is most likely to hit, and also has the most complicated allowlist.

Also, out of curiosity, do you have a CPU or trace profile that guided these optimizations?

I do not. Are there any resources you recommend to get started with that?

Thanks for the quick explanation. 🙇 Based on that, I’d expect these changes to have a noticeable impact. The Go Blog is a great starting point, as is this guide.

@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch 5 times, most recently from 06d47ca to 83708bd Compare February 15, 2025 22:10
@rgmz rgmz marked this pull request as ready for review February 15, 2025 22:11
@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch from 83708bd to 7427792 Compare February 15, 2025 22:12
@rgmz
Copy link
Contributor Author

rgmz commented Feb 15, 2025

@zricethezav I've moved the new logic behind a feature-flag, in case we want to test this "in the field" and compare results.

@rgmz rgmz changed the title Experiment: allowlist optimizations Add experimental allowlist optimizations Feb 15, 2025
@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch 3 times, most recently from bd3f576 to a087f16 Compare February 15, 2025 22:39
@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch from a087f16 to 25f2d70 Compare March 1, 2025 14:56
@rgmz
Copy link
Contributor Author

rgmz commented Mar 2, 2025

On a large (15GB) repository, this reduced the scanning time from ~46m to ~32m.

Before

(1) 2:53PM INF scanned ~15450035936 bytes (15.45 GB) in 46m53s
...
(2) 7:27PM INF scanned ~15450035936 bytes (15.45 GB) in 46m7s

After

(1) 5:25PM INF scanned ~15450035936 bytes (15.45 GB) in 32m58s
...
(2) 6:15PM INF scanned ~15450035936 bytes (15.45 GB) in 32m41s

@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch from 25f2d70 to 13b4f91 Compare March 2, 2025 15:55
Copy link
Contributor Author

@rgmz rgmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zricethezav This isn't perfect but it's ready for review.

}
}
return false, ""
}

func (a *Allowlist) Validate() error {
flagOnce.Do(func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on an 'optional' call to Validate to set internal state is bad. A better idea is probably to make a NewAllowlist(...opts) function, but I didn't want to add noise before this change got reviewed.

@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch 4 times, most recently from 75b058f to 264f2d8 Compare April 13, 2025 14:55
@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch from 264f2d8 to 227a4a0 Compare April 30, 2025 21:28
@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch 3 times, most recently from 85683f9 to ff80727 Compare May 19, 2025 15:04
@zricethezav
Copy link
Collaborator

this'll be up next after #1872

@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch from ff80727 to 9005a8b Compare June 1, 2025 17:15
@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch from 9005a8b to ee86073 Compare June 6, 2025 01:31
@zricethezav
Copy link
Collaborator

I think this is good to go. @rgmz wdyt about giving this 2-3 weeks to bake as experimental then making this optimization the default behavior?

@rgmz
Copy link
Contributor Author

rgmz commented Jun 6, 2025

I think this is good to go. @rgmz wdyt about giving this 2-3 weeks to bake as experimental then making this optimization the default behavior?

That sounds good to me! What do you think about @bplaxco's suggestion to use a field on Config instead of a global feature flag?

@zricethezav
Copy link
Collaborator

That sounds good to me! What do you think about #1731 (comment)?

Sure, that sounds reasonable 👍🏻

@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch from 0568322 to 7ae6fbf Compare June 7, 2025 13:30
@rgmz

This comment was marked as resolved.

@rgmz rgmz force-pushed the feat/allowlist-ahocorasick branch from 7ae6fbf to 3caec4d Compare June 7, 2025 21:08
@bplaxco
Copy link
Contributor

bplaxco commented Jun 8, 2025

"Crisp and clean, no caffeine" 👌👌

@zricethezav zricethezav merged commit 9faaa4a into gitleaks:master Jun 8, 2025
2 checks passed
@rgmz rgmz deleted the feat/allowlist-ahocorasick branch June 8, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants