Skip to content

Conversation

zricethezav
Copy link
Collaborator

Description:

Attempts to introduce escaped unicode decoding. This supports two kinds of escaped unicode; standard notation and common escape \, \\ sequences.

Checklist:

  • Does your PR pass tests?
  • Have you written new tests for your changes?
  • Have you lint your code locally prior to submission?

@zricethezav
Copy link
Collaborator Author

@bplaxco would love to see if this significantly slows down your benchmarks

@bplaxco
Copy link
Contributor

bplaxco commented May 14, 2025

Note: I wouldn't call this "good testing", I just kinda ran a few things when I had a minute yesterday evening but didn't really get to sit down and do it careflly. But I figured' I'd share what I got regardless ^_^

Did some basic hyperfine tests (ignore errors, 3 warmups, 10 runs):

baseline == master branch
unicode-decoder == this branch rebased on master

I pulled my own gitleaks config just because I wanted to use a consistent set of patterns between these two and the gitleaks command I had installed in my package manager.

Benchmark 1: ./baseline --config gitleaks.toml git --max-decode-depth 8 gitleaks.git
  Time (mean ± σ):      5.275 s ±  0.052 s    [User: 24.035 s, System: 1.360 s]
  Range (min … max):    5.197 s …  5.336 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: ./unicode-decoder --config gitleaks.toml git --max-decode-depth 8 gitleaks.git
  Time (mean ± σ):      5.305 s ±  0.174 s    [User: 24.855 s, System: 1.348 s]
  Range (min … max):    5.168 s …  5.756 s    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ./baseline --config gitleaks.toml git --max-decode-depth 8 gitleaks.git ran
    1.01 ± 0.03 times faster than ./unicode-decoder --config gitleaks.toml git --max-decode-depth 8 gitleaks.git

Benchmark 1: ./baseline --config gitleaks.toml dir --max-decode-depth 8 gitleaks.git
  Time (mean ± σ):      66.4 ms ±   4.2 ms    [User: 71.2 ms, System: 23.1 ms]
  Range (min … max):    60.2 ms …  76.3 ms    38 runs

Benchmark 2: ./unicode-decoder --config gitleaks.toml dir --max-decode-depth 8 gitleaks.git
  Time (mean ± σ):      59.5 ms ±   2.1 ms    [User: 62.5 ms, System: 21.9 ms]
  Range (min … max):    55.4 ms …  64.2 ms    47 runs

Summary
  ./unicode-decoder --config gitleaks.toml dir --max-decode-depth 8 gitleaks.git ran
    1.12 ± 0.08 times faster than ./baseline --config gitleaks.toml dir --max-decode-depth 8 gitleaks.git

(note: dir is probably so long because I expect it's going down into the .git dir for the repo and scanning those large files serially)

I did one diagnostics run of it against the kubernetes repo:

(Note: I had CPU and memory diagnostics running at the same time which probably isn't the best idea for clean results).

image

Thoughts:

Looks like the kubernetes repo ends up being a great benchmarking repo, there's lots of b64, percent encoded, unicode escaped data in there.

I'd probably be good to do something like this and pick through it, preferably on an idle system:

git clone --mirror git@github.com:kubernetes/kubernetes.git

hyperfine \
  --export-json unicode-perf.json -w 3 -i \
  './baseline --config gitleaks.toml git --max-decode-depth 8 kubernetes.git' \
  './unicode-decoder --config gitleaks.toml git --max-decode-depth 8 kubernetes.git' \
  './baseline --config gitleaks.toml git kubernetes.git' \
  './unicode-decoder --config gitleaks.toml git kubernetes.git'

# Assuming unicode-decoder is rebased on main
./baseline --diagnostics-dir=baseline-k8s --diagnostics=cpu --config gitleaks.toml git --max-decode-depth 8 kubernetes.git
./unicode-decoder --diagnostics-dir=unicode-k8s --diagnostics=cpu --config gitleaks.toml git --max-decode-depth 8 kubernetes.git

@zricethezav zricethezav merged commit 0589ae0 into master May 14, 2025
5 checks passed
@zricethezav zricethezav deleted the unicode-decoder branch May 14, 2025 15:14
alayne222 pushed a commit to alayne222/gitleaks that referenced this pull request May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants