-
Notifications
You must be signed in to change notification settings - Fork 769
kallsyms: optimise parsing #1584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great optimization! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Alban for making it less memory hungry and faster :)
fields := bytes.Fields(scanner.Bytes()) | ||
line := scanner.Bytes() | ||
// bytes.Fields is expensive, so filter for '[' first | ||
if !bytes.ContainsRune(line, '[') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not that sure (since I don't know the format that well), but if the closing bracket ]
is always at the end of the line we could utilize that to improve performance even further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean replacing bytes.ContainsRune
by something like this?
if strings.HasSuffix(line, "]")
That should filter the same amount of lines as bytes.ContainsRune
, so it's a matter of if strings.HasSuffix
is faster than bytes.ContainsRune
. I have not tested it but I suspect it is similar for short lines like the ones in kallsyms, and in any cases would not change the amount of memory allocations.
The printfs for /proc/kallsyms are implemented here:
https://github.com/torvalds/linux/blob/eca631b8fe808748d7585059c4307005ca5c5820/kernel/kallsyms.c#L446-L465
The last change was 3 years ago to add the buildid inside the brackets:
torvalds/linux@9294523#diff-9538b26d3e082f233e6adac664cd2c14cbf2d510d5d7f286eef329c58de87eadR451
This is only for kernels with CONFIG_STACKTRACE_BUILD_ID
(which is not set for my kernel). But the way cilium/ebpf parses the line is fine with regards to that change.
So I would say the format seems more or less stable but can change. And testing "contains-[" should be more stable than testing "suffix-]", in case Linux adds new fields at the end.
That is what I meant. Instead of looping through every rune in the string it only needs to check the last one.
That is what I had feared |
This is a nice band-aid, but conflicts with what #1578 is trying to do. Reading this PR here made me realize the proposal in #1578 is even slower and more expensive, that will need to be improved, and we shouldn't have multiple ways of scanning kallsyms. As an experiment, I wrote a naive barebones line parser for kallsyms that calls
I got your results down to this, basically cutting down another 3/4 of the size and number of allocations:
Note: this similarly skips lines containing Perhaps we can write a simple kallsyms iterator on top of
This avoids having to allocate an individual |
In tests with pprof in Inspektor Gadget, I see allocations getting reduced from 34MB to 12MB. I also added a standalone benchmark. tl;dr: 12% faster, 42% less allocations in bytes, 48% less allocations $ go test -run=BenchmarkKernelModuleMemory -bench=. ./internal/kallsyms/... -count 10 | tee main.bench $ go test -run=BenchmarkKernelModuleMemory -bench=. ./internal/kallsyms/... -count 10 | tee patched.bench $ benchstat main.bench patched.bench goos: linux goarch: amd64 pkg: github.com/cilium/ebpf/internal/kallsyms cpu: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz │ main.bench │ patched.bench │ │ sec/op │ sec/op vs base │ KernelModuleMemory-4 299.5m ± 9% 262.3m ± 11% -12.44% (p=0.000 n=10) │ main.bench │ patched.bench │ │ B/op │ B/op vs base │ KernelModuleMemory-4 37.70Mi ± 0% 21.67Mi ± 0% -42.53% (p=0.000 n=10) │ main.bench │ patched.bench │ │ allocs/op │ allocs/op vs base │ KernelModuleMemory-4 432.1k ± 0% 221.9k ± 0% -48.64% (p=0.000 n=10) Signed-off-by: Alban Crequy <albancrequy@linux.microsoft.com>
I rebased my PR to fix the conflict with #1578. Your idea of custom scanner/iterator looks interesting. Is it something you want to write after this PR gets merged? |
In tests with pprof in Inspektor Gadget, I see allocations getting reduced from 34MB to 12MB.
I also added a standalone benchmark.
tl;dr: 12% faster, 42% less allocations in bytes, 48% less allocations
xref inspektor-gadget/inspektor-gadget#3571 (comment)
cc @burak-ok