Skip to content

Conversation

bplaxco
Copy link
Contributor

@bplaxco bplaxco commented May 22, 2025

Description:

Related to #1841

Talked with @zricethezav to see if I could take a go at doing it in-memory and with a change to how sources are handled.

The idea is that sources could have a Fragments(FragmentFunc) error function where type FragmentFunc func(Fragment, error) error. The behavior should be like WalkDir where the callback determines how an error should be handled and if the callback returns an error, fragment traversal stops and the function returns that error.

Tasks:

  • Convert the file based detect functions to Source.Fragment calls
  • Update the cmds for files to use Source.Fragments
  • Add basic archive support to File
  • Add new tests for the File/Files changes
  • Fix issues with certain types of nested archives
  • Add Git source and update the DetectGit related functions ad cmds
  • Add archive support in Git
  • Lint the code
  • Determine how to best represent "inner" paths (e.g. /external.zip!inner/path.tar.gz!even/more/inner)
  • Add add remaining tests for Git
  • Adjust finding link paths to stop at the inner path separator (and write tests)
  • Add --max-archive-depth= (default 0) flag && update docs (thank you @alayne222!)
  • Review readers to look for places where they aren't closed
  • Confirm backwards compatibility (i.e. line numbers, secrets counts, etc)[1]
  • Log file path here
  • Performance test & tune
  • Profit 😎

[1]: I want to make sure I completely resolved the issue I was seeing before where the secret counts weren't lining up properly. Some of it was a problem related to me using a buffered reader in one spot and then using the reader again. But then there were some weird cases where I wasn't seeing the same number of secrets in every place.

Checklist:

  • Does your PR pass tests?
  • Have you written new tests for your changes?
  • Have you lint your code locally prior to submission?

@bplaxco bplaxco changed the title Archive support [WIP] Archive support May 23, 2025
@bplaxco bplaxco force-pushed the archives branch 4 times, most recently from 4181d51 to 0b4c674 Compare May 23, 2025 17:06
(Benchmark results are only added if tools are installed)

Example output:

```
========================================================================
generating profile data
------------------------------------------------------------------------
- mode: dir
  benchmark:
    tool: hyperfine
    path: profile/1748019612/dir/benchmark.json
    results:
      mean: 1.12852930166
      stddev: 0.12163775966037399
      median: 1.07242928416
      user: 3.9699382799999996
      system: 0.04382341999999999
      min: 1.03207673016
      max: 1.3913475181600001
  profile:
    - mode: cpu
      path: profile/1748019612/dir/cpu.pprof
      view: go tool pprof -http=localhost: ./gitleaks profile/1748019612/dir/cpu.pprof
    - mode: mem
      path: profile/1748019612/dir/mem.pprof
      view: go tool pprof -http=localhost: ./gitleaks profile/1748019612/dir/mem.pprof
    - mode: trace
      path: profile/1748019612/dir/trace.out
      view: go tool trace profile/1748019612/dir/trace.out
- mode: git
  ...snip...
```
@bplaxco bplaxco force-pushed the archives branch 2 times, most recently from 6c2621d to 08fd665 Compare May 24, 2025 01:07
@rgmz rgmz mentioned this pull request May 24, 2025
11 tasks
@bplaxco bplaxco force-pushed the archives branch 5 times, most recently from a33ca76 to 1e55376 Compare May 25, 2025 04:33
@bplaxco
Copy link
Contributor Author

bplaxco commented May 29, 2025

Okay, I'm pretty happy with backwards compatibility when --max-archive-depth isn't set. I'm not seeing any outputs in the diffs. Poking at the performance.

@bplaxco bplaxco changed the title [WIP] Archive support Archive support May 30, 2025
@bplaxco bplaxco marked this pull request as ready for review May 30, 2025 05:21
@bplaxco
Copy link
Contributor Author

bplaxco commented May 30, 2025

Stats

The archives and master commands are built from the branches by the same names. And the gitleaks.toml is the one from the archives branch. It was specified so that some repos with .gitleaks.toml's wouldn't change my test results.

Scanning the kubernetes repo:

./archives git \
  --config gitleaks.toml --exit-code=0 \
  --max-archive-depth=8 --max-decode-depth=8 \
  repos/git/kubernetes 
--- Statistics for Archives Run ---
  Number of samples: 28
  Mean VmRSS: 77109.14 kB
  Median VmRSS: 74426.00 kB
  Standard Deviation: 12802.88 kB
  Minimum VmRSS: 28624.00 kB
  Maximum VmRSS: 101880.00 kB
  Range (Max - Min): 73256.00 kB
./master git \
  --config gitleaks.toml --exit-code=0 \
  --max-decode-depth=8 \
  repos/git/kubernetes 
--- Statistics for Master Run ---
  Number of samples: 18
  Mean VmRSS: 69884.22 kB
  Median VmRSS: 67014.00 kB
  Standard Deviation: 22562.21 kB
  Minimum VmRSS: 24572.00 kB
  Maximum VmRSS: 116020.00 kB
  Range (Max - Min): 91448.00 kB

Scanning the gitlab repo:

./archives git \
  --config gitleaks.toml --exit-code=0 \
  --max-archive-depth=8 --max-decode-depth=8 \
  repos/git/gitlab 
324456 commits scanned.  # fewer due to diff-filter
scanned ~3144707626 bytes (3.14 GB) in 19m6s
  --- Statistics for Archives Run ---
  Number of samples: 77
  Mean VmRSS: 535470.08 kB
  Median VmRSS: 483032.00 kB
  Standard Deviation: 132735.50 kB
  Minimum VmRSS: 24268.00 kB
  Maximum VmRSS: 836596.00 kB
  Range (Max - Min): 812328.00 kB
./master git \
  --config gitleaks.toml --exit-code=0 \
  --max-decode-depth=8 \
  repos/git/gitlab 
325000 commits scanned.
scanned ~1486234691 bytes (1.49 GB) in 18m6s
--- Statistics for Master Run ---
  Number of samples: 73
  Mean VmRSS: 540648.44 kB
  Median VmRSS: 520232.00 kB
  Standard Deviation: 138287.48 kB
  Minimum VmRSS: 21216.00 kB
  Maximum VmRSS: 842784.00 kB
  Range (Max - Min): 821568.00 kB

--- Comparative Analysis ---
  Average VmRSS difference: -5178.36 kB
  Max VmRSS difference: -6188.00 kB
  Standard Deviation difference: -5551.99 kB

@zricethezav
Copy link
Collaborator

@bplaxco so it looks like the archive branch profiled against the k8 repo uses significantly less memory (20MB) than master? Against GitLab it looks about the same. Any hypothesis why we see the archive perform well against k8? Is it as simple as yield is more memory efficient that sending on chans?

@bplaxco
Copy link
Contributor Author

bplaxco commented May 30, 2025

@bplaxco so it looks like the archive branch profiled against the k8 repo uses significantly less memory (20MB) than master? Against GitLab it looks about the same. Any hypothesis why we see the archive perform well against k8? Is it as simple as yield is more memory efficient that sending on chans?

@zricethezav I'm not sure, would you be opposed to me tweaking the diagnostic feature to use net/http/pprof so you can sample the pprof data as needed during the run before it writes out the final dump at the end?

I might be able to take some of those and the pprof diff flags to see where the savings are happening.

My gut says it's prob yield + not scanning as many commits, but that's just a guess.

@zricethezav
Copy link
Collaborator

@zricethezav I'm not sure, would you be opposed to me tweaking the diagnostic feature to use net/http/pprof so you can sample the pprof data as needed during the run before it writes out the final dump at the end?

If you're up for it!

Copy link
Contributor

@rgmz rgmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments and observations, while I wait for this to chew through a >10GB repository...

@bplaxco bplaxco force-pushed the archives branch 5 times, most recently from f8cd34c to 9ed45b8 Compare June 1, 2025 03:14
- Update blobReader.Close() to discard the buffer
- Misc logger issues & uses
- Tweak .golangci.yaml to default to none
- Discard remaining blobReader data on close
- Undo a De Morgan's Law suggestion (and disable QF1001)
gitleaks --diagnostics=http ...
-----

From the net/http/pprof docs:

Use the pprof tool to look at the heap profile:
```
go tool pprof http://localhost:6060/debug/pprof/heap
```

Or to look at a 30-second CPU profile:
```
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
```

Or to look at the goroutine blocking profile, after calling runtime.SetBlockProfileRate in your program:
```
go tool pprof http://localhost:6060/debug/pprof/block
```

Or to look at the holders of contended mutexes, after calling runtime.SetMutexProfileFraction in your program:
```
go tool pprof http://localhost:6060/debug/pprof/mutex
```

(For more info see https://pkg.go.dev/net/http/pprof)
@bplaxco
Copy link
Contributor Author

bplaxco commented Jun 1, 2025

@zricethezav I'm not sure, would you be opposed to me tweaking the diagnostic feature to use net/http/pprof so you can sample the pprof data as needed during the run before it writes out the final dump at the end?

If you're up for it!

Ended up not being too bad ^_^ 2886f77

@@ -118,7 +118,7 @@ jobs:
- id: gitleaks
```

for a [native execution of GitLeaks](https://github.com/gitleaks/gitleaks/releases) or use the [`gitleaks-docker` pre-commit ID](https://github.com/gitleaks/gitleaks/blob/master/.pre-commit-hooks.yaml) for executing GitLeaks using the [official Docker images](#docker)
for a [native execution of gitleaks](https://github.com/gitleaks/gitleaks/releases) or use the [`gitleaks-docker` pre-commit ID](https://github.com/gitleaks/gitleaks/blob/master/.pre-commit-hooks.yaml) for executing gitleaks using the [official Docker images](#docker)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave these capitalized


The [compression](https://github.com/mholt/archives?tab=readme-ov-file#supported-compression-formats)
and [archive](https://github.com/mholt/archives?tab=readme-ov-file#supported-archive-formats)
formats supported by mholt's [archives package](https://github.com/mholt/archives)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mholt, thanks for the excellent archives package! 🙇🏻

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're welcome!! Glad you find it useful.

@zricethezav
Copy link
Collaborator

Amazing work @bplaxco! (per usual). @rgmz thank you for the help in the review 🙇🏻. I think this PR is in fantastic shape. All the testing, profiling, and manually running gives me confidence this likely won't break anything. If it does, we'll fix'er up real quick.

Not only do we get archive scanning out of this but it's also a huge improvement to the source package. Sources generate fragments. Simple as.

Gonna hit the merge button

@zricethezav zricethezav merged commit 782f310 into gitleaks:master Jun 1, 2025
2 checks passed
@bplaxco bplaxco deleted the archives branch June 1, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants