Skip to content

Conversation

alban
Copy link
Contributor

@alban alban commented Oct 14, 2024

In tests with pprof in Inspektor Gadget, I see allocations getting reduced from 34MB to 12MB.

I also added a standalone benchmark.

tl;dr: 12% faster, 42% less allocations in bytes, 48% less allocations

$ go test -run=BenchmarkKernelModuleMemory -bench=. ./internal/kallsyms/... -count 10 | tee main.bench
$ go test -run=BenchmarkKernelModuleMemory -bench=. ./internal/kallsyms/... -count 10 | tee patched.bench
$ benchstat main.bench patched.bench
goos: linux
goarch: amd64
pkg: github.com/cilium/ebpf/internal/kallsyms
cpu: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
                     │ main.bench  │            patched.bench            │
                     │   sec/op    │   sec/op     vs base                │
KernelModuleMemory-4   299.9m ± 3%   262.1m ± 6%  -12.59% (p=0.000 n=10)

                     │  main.bench  │            patched.bench             │
                     │     B/op     │     B/op      vs base                │
KernelModuleMemory-4   37.70Mi ± 0%   21.67Mi ± 0%  -42.53% (p=0.000 n=10)

                     │ main.bench  │            patched.bench            │
                     │  allocs/op  │  allocs/op   vs base                │
KernelModuleMemory-4   432.1k ± 0%   221.9k ± 0%  -48.64% (p=0.000 n=10)

xref inspektor-gadget/inspektor-gadget#3571 (comment)

cc @burak-ok

Copy link
Contributor

@florianl florianl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great optimization! 👍

Copy link
Contributor

@burak-ok burak-ok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Alban for making it less memory hungry and faster :)

fields := bytes.Fields(scanner.Bytes())
line := scanner.Bytes()
// bytes.Fields is expensive, so filter for '[' first
if !bytes.ContainsRune(line, '[') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not that sure (since I don't know the format that well), but if the closing bracket ] is always at the end of the line we could utilize that to improve performance even further

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean replacing bytes.ContainsRune by something like this?

if strings.HasSuffix(line, "]")

That should filter the same amount of lines as bytes.ContainsRune, so it's a matter of if strings.HasSuffix is faster than bytes.ContainsRune. I have not tested it but I suspect it is similar for short lines like the ones in kallsyms, and in any cases would not change the amount of memory allocations.

The printfs for /proc/kallsyms are implemented here:
https://github.com/torvalds/linux/blob/eca631b8fe808748d7585059c4307005ca5c5820/kernel/kallsyms.c#L446-L465

The last change was 3 years ago to add the buildid inside the brackets:
torvalds/linux@9294523#diff-9538b26d3e082f233e6adac664cd2c14cbf2d510d5d7f286eef329c58de87eadR451
This is only for kernels with CONFIG_STACKTRACE_BUILD_ID (which is not set for my kernel). But the way cilium/ebpf parses the line is fine with regards to that change.

So I would say the format seems more or less stable but can change. And testing "contains-[" should be more stable than testing "suffix-]", in case Linux adds new fields at the end.

@burak-ok
Copy link
Contributor

so it's a matter of if strings.HasSuffix is faster than bytes.ContainsRune

That is what I meant. Instead of looping through every rune in the string it only needs to check the last one.

And testing "contains-[" should be more stable than testing "suffix-]", in case Linux adds new fields at the end.

That is what I had feared

@ti-mo
Copy link
Collaborator

ti-mo commented Oct 15, 2024

This is a nice band-aid, but conflicts with what #1578 is trying to do. Reading this PR here made me realize the proposal in #1578 is even slower and more expensive, that will need to be improved, and we shouldn't have multiple ways of scanning kallsyms.

As an experiment, I wrote a naive barebones line parser for kallsyms that calls bytes.IndexAny(line, " ") repeatedly and re-slices the line:

func parseLine(line []byte) (addr uint64, name string, mod string, error)

I got your results down to this, basically cutting down another 3/4 of the size and number of allocations:

BenchmarkKernelModuleMemory-16    	       5	 220576087 ns/op	13158904 B/op	  148693 allocs/op

Note: this similarly skips lines containing [ and looks for tT symbol types only.

Perhaps we can write a simple kallsyms iterator on top of bufio.ReadLine (not using bufio.Scanner() since it allocates a ton). I'm not sure if we need to deal with utf-8 for splitting on space and \t. Having one iterator that can jump to the next line or word boundary would be ideal.

func (s *Scanner) NextLine() bool
func (s *Scanner) NextWord() bool

// These only return the current word, never a full line.
func (s *Scanner) WordBytes() []byte
func (s *Scanner) WordText() string

This avoids having to allocate an individual bufio.ScanWords scanner around a bytes.Reader for every line, since we have many short lines we need to parse into a (uint64, string) or (uint64, string, string) if it involved a kmod. If we need to filter out certain symbol types, we can call s.NextLine() if we're not interested in the remainder of the line. What we need to avoid is allocating slices and letting them escape to the heap, which is something bytes.Fields does in spades. Returning [][]byte should always result in 4 or 5 slice headers and a backing array for the outer slice being put on the heap.

In tests with pprof in Inspektor Gadget, I see allocations getting
reduced from 34MB to 12MB.

I also added a standalone benchmark.

tl;dr: 12% faster, 42% less allocations in bytes, 48% less allocations

$ go test -run=BenchmarkKernelModuleMemory -bench=. ./internal/kallsyms/... -count 10 | tee main.bench
$ go test -run=BenchmarkKernelModuleMemory -bench=. ./internal/kallsyms/... -count 10 | tee patched.bench
$ benchstat main.bench patched.bench
goos: linux
goarch: amd64
pkg: github.com/cilium/ebpf/internal/kallsyms
cpu: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
                     │ main.bench  │            patched.bench             │
                     │   sec/op    │    sec/op     vs base                │
KernelModuleMemory-4   299.5m ± 9%   262.3m ± 11%  -12.44% (p=0.000 n=10)

                     │  main.bench  │            patched.bench             │
                     │     B/op     │     B/op      vs base                │
KernelModuleMemory-4   37.70Mi ± 0%   21.67Mi ± 0%  -42.53% (p=0.000 n=10)

                     │ main.bench  │            patched.bench            │
                     │  allocs/op  │  allocs/op   vs base                │
KernelModuleMemory-4   432.1k ± 0%   221.9k ± 0%  -48.64% (p=0.000 n=10)

Signed-off-by: Alban Crequy <albancrequy@linux.microsoft.com>
@alban
Copy link
Contributor Author

alban commented Oct 16, 2024

I rebased my PR to fix the conflict with #1578.

Your idea of custom scanner/iterator looks interesting. Is it something you want to write after this PR gets merged?

@ti-mo
Copy link
Collaborator

ti-mo commented Oct 17, 2024

@alban Closing this in favor of #1588. It reduces allocs more or less to the same degree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants