-
Notifications
You must be signed in to change notification settings - Fork 175
cleanup(modern_bpf): r/w ro ebpf maps best practices #2399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cleanup(modern_bpf): r/w ro ebpf maps best practices #2399
Conversation
…all_table` and `g_ia32_to_64_table` to rodata. Signed-off-by: Federico Di Pierro <nierro92@gmail.com> Co-authored-by: Luca Guerra <luca@guerra.sh>
Please double check driver/API_VERSION file. See versioning. /hold |
@FedeDP : if you agree, I would propose this for version 0.21.0, WDYT? |
…s table Signed-off-by: Luca Guerra <luca@guerra.sh> Co-authored-by: Kondah Mouad <kondah.mouad@gmail.com>
Signed-off-by: Luca Guerra <luca@guerra.sh>
a366d3f
to
a8c9ae0
Compare
Perf diff from master - unit tests
Heap diff from master - unit tests
Heap diff from master - scap file
Benchmarks diff from master
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2399 +/- ##
=======================================
Coverage 77.26% 77.26%
=======================================
Files 227 227
Lines 30323 30323
Branches 4644 4644
=======================================
Hits 23429 23429
Misses 6894 6894
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
/milestone 0.21.0 |
Sure Luca! |
Will run kernel-testing framework against this PR and post the results! 🙏 |
Also would love a review by @Andreagit97 ❤️ |
Everything good!
X64:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes! As i wrote in one comment the only downside that I could see is a possible decrease in perf. We should probably keep an eye on possible perf regression after we merge this
userspace/libpman/src/maps.c
Outdated
struct capture_settings settings; | ||
pman_get_capture_settings(&settings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would return here if we are not able to get the settings. There is the risk we will write junk at the next update
struct capture_settings settings; | |
pman_get_capture_settings(&settings); | |
struct capture_settings settings = { 0 }; | |
if(pman_get_capture_settings(&settings)!=0){return;} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, done!
userspace/libpman/src/maps.c
Outdated
continue; | ||
} | ||
} | ||
} | ||
|
||
int pman_init_settings_map() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it sounds like pman_update_capture_settings
with an initial struct capture_settings settings = {};
maybe we can avoid duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, implemented your suggestion
static __always_inline bool maps__64bit_interesting_syscall(uint32_t syscall_id) { | ||
return g_64bit_interesting_syscalls_table[syscall_id & (SYSCALL_TABLE_SIZE - 1)]; | ||
static __always_inline bool maps__interesting_syscall_64bit(uint32_t syscall_id) { | ||
bool *ret = bpf_map_lookup_elem(&interesting_syscalls_table_64bit, &syscall_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main reason we chose a memory-mapped global variable initially. Calling a bpf helper many times (more than 20 Millions/s) could decrease perf. We are probably talking about a few ns, but under heavy loads, it could have a little impact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as we get it, BPF_F_MMAPABLE
flag mmaps the eBPF maps to userspace memory, increasing perf for userspace only; in eBPF, the read and writes still pass through the helpers.
Perhaps we are wrong though?
See:
https://docs.kernel.org/bpf/map_array.html
Since kernel 5.5, memory mapping may be enabled for BPF_MAP_TYPE_ARRAY by setting the flag BPF_F_MMAPABLE. The map definition is page-aligned and starts on the first page. Sufficient page-sized and page-aligned blocks of memory are allocated to store all array values, starting on the second page, which in some cases will result in over-allocation of memory. The benefit of using this is increased performance and ease of use since userspace programs would not be required to use helper functions to access and mutate data.
https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_ARRAY/#bpf_f_mmapable
Setting this flag on a BPF_MAP_TYPE_ARRAY will allow userspace programs to mmap the array values into the userspace process, effectively making a shared memory region between eBPF programs and a userspace program.
This can significantly improve read and write performance since there is no syscall overhead to access the map.
https://patchwork.ozlabs.org/project/netdev/patch/20191117172806.2195367-4-andriin@fb.com/
Add ability to memory-map contents of BPF array map. This is extremely useful
for working with BPF global data from userspace programs. It allows to avoid
typical bpf_map_{lookup,update}_elem operations, improving both performance
and usability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FedeDP is right, that's what we figured during the analysis for it. The only doubt I had about this is some extra perf optimization that might happen because of how libbpf handles .bss
, .rodata
and .data
with a special instruction: https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_ARRAY/#global-data .
It is unclear whether that has performance impact, I agree that the easiest way is to keep an eye on perf after it's merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm if I recall it correctly, the global variables are accessed without helpers but just using direct memory access. I quickly tried it
char LICENSE[] SEC("license") = "Dual BSD/GPL";
unsigned long stop = 0;
SEC("tp_btf/sys_enter")
int global(void *ctx) {
stop++;
return 0;
}
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 1);
__type(key, uint32_t);
__type(value, unsigned long);
} stop_map SEC(".maps");
SEC("tp_btf/sys_enter")
int map_array(void *ctx) {
uint32_t key = 0;
unsigned long *value = bpf_map_lookup_elem(&stop_map, &key);
*value++;
return 0;
}
Globals:
0: R1=ctx() R10=fp0
; stop++; @ globals.bpf.c:12
0: (18) r1 = 0xffffbfb7c1226000 ; R1_w=map_value(map=globals_.bss,ks=4,vs=8)
2: (79) r2 = *(u64 *)(r1 +0) ; R1_w=map_value(map=globals_.bss,ks=4,vs=8) R2_w=scalar()
3: (07) r2 += 1 ; R2_w=scalar()
4: (7b) *(u64 *)(r1 +0) = r2 ; R1_w=map_value(map=globals_.bss,ks=4,vs=8) R2_w=scalar()
; return 0; @ globals.bpf.c:13
5: (b7) r0 = 0 ; R0_w=0
6: (95) exit
processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
Map:
0: R1=ctx() R10=fp0
; int map_array(void *ctx) { @ globals.bpf.c:24
0: (b7) r1 = 0 ; R1_w=0
; uint32_t key = 0; @ globals.bpf.c:25
1: (63) *(u32 *)(r10 -4) = r1
mark_precise: frame0: last_idx 1 first_idx 0 subseq_idx -1
mark_precise: frame0: regs=r1 stack= before 0: (b7) r1 = 0
2: R1_w=0 R10=fp0 fp-8=0000????
2: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
; @ globals.bpf.c:0
3: (07) r2 += -4 ; R2_w=fp-4
; unsigned long *value = bpf_map_lookup_elem(&stop_map, &key); @ globals.bpf.c:26
4: (18) r1 = 0xffff9a8a4296b400 ; R1_w=map_ptr(map=stop_map,ks=4,vs=8)
6: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=1,map=stop_map,ks=4,vs=8)
; return 0; @ globals.bpf.c:28
7: (b7) r0 = 0 ; R0_w=0
8: (95) exit
processed 8 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
Probably looking at the bytecode of our probe before/after we should see the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thanks for clarifying it! I guess you are right then, the emitted code is more optimized with the globals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Andreagit97 , makes sense, but fortunately the kernel is helping us by optimizing it away.
Indeed, the two compiled programs look different:
Using a map (i.e. this patch):
0000000000000090 <LBB0_26>:
18: 63 7a f8 ff 00 00 00 00 *(u32 *)(r10 - 0x8) = r7
19: bf a2 00 00 00 00 00 00 r2 = r10
20: 07 02 00 00 f8 ff ff ff r2 += -0x8
21: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
23: 85 00 00 00 01 00 00 00 call 0x1
Using .bss
:
0000000000000090 <LBB0_26>:
18: bf 81 00 00 00 00 00 00 r1 = r8
19: 57 01 00 00 ff 01 00 00 r1 &= 0x1ff
20: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
22: 0f 12 00 00 00 00 00 00 r2 += r1
23: 71 22 00 00 00 00 00 00 r2 = *(u8 *)(r2 + 0x0)
In the compiled object file. There is obviously a call. However, at load time the kernel is able to optimize this since it's an (almost)fixed-size array map; this uses more instructions than the .bss
version but I guess it's not as bad as an actual call, let's take a look:
# bpftool prog dump xlated name sys_enter opcodes
...
17: (18) r1 = map[id:8834]
18 11 00 00 82 22 00 00 00 00 00 00 00 00 00 00
19: (07) r1 += 272
07 01 00 00 10 01 00 00
20: (61) r0 = *(u32 *)(r2 +0)
61 20 00 00 00 00 00 00
21: (35) if r0 >= 0x200 goto pc+3
35 00 03 00 00 02 00 00
22: (67) r0 <<= 3
67 00 00 00 03 00 00 00
23: (0f) r0 += r1
0f 10 00 00 00 00 00 00
24: (05) goto pc+1
05 00 01 00 00 00 00 00
25: (b7) r0 = 0
b7 00 00 00 00 00 00 00
; if(ret == NULL) {
26: (55) if r0 != 0x0 goto pc+30
55 00 1e 00 00 00 00 00
...
; return *ret;
57: (71) r1 = *(u8 *)(r0 +0)
71 01 00 00 00 00 00 00
Originally we had 5 instructions, in the map version we have 10 instructions at the eBPF level. Not great but acceptable I think, also I'm not sure what happens at native code jit compile time or which kernel versions are able to perform this optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm, I see, I never noticed that! Yes, in this case, the overhead seems negligible
int pman_mark_single_64bit_syscall(int syscall_id, bool interesting) { | ||
char error_message[MAX_ERROR_MESSAGE_LEN]; | ||
int fd = bpf_map__fd(g_state.skel->maps.interesting_syscalls_table_64bit); | ||
if(bpf_map_update_elem(fd, &syscall_id, &interesting, BPF_ANY) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is called frequently it could have an impact as well but AFAIK we don't call it so frequently at runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly , this is not a problem (as far as i know, actually nobody calls this at runtime, only at engine init time. Also given we have ~500 ppm_sc codes, i don't expect people to call it more than like ~100 times for each batch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no noticeable difference in startup times, and even if the client would dynamically set the interesting syscall set it wouldn't do it that frequently. I toyed with the idea of updating more elements in a single call but I don't think it's worth it now
Signed-off-by: Luca Guerra <luca@guerra.sh> Co-authored-by: Andrea Terzolo <andreaterzolo3@gmail.com>
Thanks for review @Andreagit97 , addressed your comments. On the performance side we should keep an eye on it but it seems like we're only adding a few instructions and no calls if the loader is smart enough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
LGTM label has been added. Git tree hash: 839a6b6a14983b18d261a90ddd6e60c1b31ef040
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Andreagit97, FedeDP, LucaGuerra The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
What type of PR is this?
/kind cleanup
Any specific area of the project related to this PR?
/area driver-modern-bpf
Does this PR require a change in the driver versions?
No
What this PR does / why we need it:
Our ebpf probe is saving many things, including the interesting syscall set, into the
.bss
section which is set by libbpf as mmapable. It is a best practice not to include security sensitive settings in an mmapable fd, so we are separating these, plus the general ebpf probe settings into their own maps; in addition, we are setting variables that are never written to read-only (by using.rodata
).Credits to @mouadk for looking at the maps permissions and suggesting this improvement!
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: