bpf: Add 'host not ready' drop reason if host policy cannot be enforced #29482

ti-mo · 2023-11-29T13:06:11Z

Repurpose unused drop reasons since 61fb508
("bpf: Rename unused drop defines to DROP_UNUSED*") for signaling the host
endpoint's policy program was attempted to be executed before it was loaded.

bpf_lxc.c contains multiple tail calls into POLICY_CALL_MAP at the HOST_EP_ID
slot. The program in this slot is provided by bpf_host.c. During first agent
startup, there are often multiple Pods pending creation due to no CNI being
available. As soon as the agent's local API becomes available, these outstanding
CNI requests have a chance to be accepted by the API handler.

If one such request is serviced before the host datapath controller has a chance
to grab the compilation lock, the endpoint program will compile and attach first.
If the host firewall is enabled, and this new workload endpoint sends a packet
to the host, the host's ingress policy needs to be enforced. However, because
bpf_host hasn't been loaded yet, this policy program is not yet present in
POLICY_CALL_MAP, resulting in a missed tail call.

One potential solution to this problem would be making sure the host datapath
always attaches before workload endpoints do. There's one problem with this
solution: clustermesh requires data from other clusters in order to correctly
populate the local ipcache, and the ipcache currently needs to be populated for
the host endpoint to finish attaching. It obtains this information through
clustermesh-apiserver, typically deployed onto the local node as a regular Pod.
This means workload endpoints must be able to deploy before the host endpoint.

As a stop gap, tolerate these kinds of drops and assign them a specific meaning,
without letting them spill over into the generic 'missed tail call' bucket. To
stabilize end-to-end tests, we're aiming to enforce zero dropped tail calls in
all CI scenarios, since it leads to packets that mysteriously go missing,
introducing chaos that's impossible to troubleshoot.

Add specific drop reason for missing tail calls if the host datapath is not ready yet

ti-mo · 2023-11-29T13:07:03Z

/test

ti-mo · 2023-11-29T13:26:06Z

/test

ti-mo · 2023-11-29T14:09:31Z

/test

ti-mo · 2023-11-29T15:16:32Z

/test

ti-mo · 2023-11-30T09:36:29Z

/test

ti-mo · 2023-11-30T15:32:56Z

/test-e2e-upgrade

ti-mo · 2023-11-30T15:33:15Z

/test

ti-mo · 2023-11-30T15:36:27Z

/ci-e2e-upgrade

ti-mo · 2023-12-04T15:10:57Z

/test

brlbil

CI part LGTM.

This excludes the drop reason introduced in #29482. It occurs when Cilium is first installed on a node, the host firewall is enabled, a workload endpoint gets created before the host endpoint, and the workload endpoint in question tries to talk to the host. Preventing these drops would require redesigning parts of the datapath, particularly the clustermesh bootstrap procedure. This is not feasible at the moment, and maybe it's not the right thing to do. Signed-off-by: Timo Beckers <timo@isovalent.com>

[ cherry-picked from cilium/cilium-cli repository ] This excludes the drop reason introduced in #29482. It occurs when Cilium is first installed on a node, the host firewall is enabled, a workload endpoint gets created before the host endpoint, and the workload endpoint in question tries to talk to the host. Preventing these drops would require redesigning parts of the datapath, particularly the clustermesh bootstrap procedure. This is not feasible at the moment, and maybe it's not the right thing to do. Signed-off-by: Timo Beckers <timo@isovalent.com>

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 29, 2023

ti-mo mentioned this pull request Nov 29, 2023

[WIP] Avoid missed tail calls into handle_lxc_traffic #29451

Closed

ti-mo force-pushed the tb/wait-host-datapath branch from ca3fe28 to 4bf367f Compare November 29, 2023 13:19

ti-mo added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Nov 29, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 29, 2023

ti-mo added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/agent Cilium agent related. labels Nov 29, 2023

ti-mo force-pushed the tb/wait-host-datapath branch from 32812b0 to bc36f7d Compare November 29, 2023 15:16

ti-mo force-pushed the tb/wait-host-datapath branch from bc36f7d to 0895c5f Compare November 30, 2023 09:36

ti-mo force-pushed the tb/wait-host-datapath branch from 0895c5f to 9c60f4f Compare November 30, 2023 15:29

ti-mo force-pushed the tb/wait-host-datapath branch 2 times, most recently from 50c61ab to ad8d210 Compare December 4, 2023 15:06

ti-mo changed the title ~~[WIP] Avoid missed tail calls into handle_lxc_traffic~~ loader: wait for host datapath to be ready before loading endpoint programs Dec 4, 2023

ti-mo marked this pull request as ready for review December 4, 2023 15:10

ti-mo requested review from a team as code owners December 4, 2023 15:10

ti-mo requested review from dylandreimerink and tommyp1ckles December 4, 2023 15:10

ti-mo mentioned this pull request Dec 4, 2023

CI: test-conn-disrupt-client failed due to interrupted traffic during upgrade/downgrade #28088

Closed

brlbil approved these changes Jan 5, 2024

View reviewed changes

ti-mo mentioned this pull request Jan 5, 2024

Respond with ICMP reply for traffic to services without backends #28157

Merged

nathanjsweet approved these changes Jan 5, 2024

View reviewed changes

ti-mo added this pull request to the merge queue Jan 9, 2024

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Jan 9, 2024

ti-mo added this pull request to the merge queue Jan 10, 2024

Merged via the queue into main with commit a15a463 Jan 10, 2024

ti-mo deleted the tb/wait-host-datapath branch January 10, 2024 16:05

ti-mo added the backport/author The backport will be carried out by the author of the PR. label Jan 11, 2024

ti-mo mentioned this pull request Jan 11, 2024

[1.15] bpf: Add 'host not ready' drop reason if host policy cannot be enforced #30203

Merged

ti-mo added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 labels Jan 11, 2024

ti-mo mentioned this pull request Jan 11, 2024

[1.14] bpf: Add 'host not ready' drop reason if host policy cannot be enforced #30204

Merged

ti-mo added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 labels Jan 11, 2024

gentoo-root added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jan 17, 2024

ti-mo added needs-backport/1.13 backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Jan 18, 2024

ti-mo mentioned this pull request Jan 18, 2024

v1.13: backport final batch of missed tail call-related fixes #30315

Merged

3 tasks

ti-mo added backport-pending/1.13 and removed needs-backport/1.13 labels Jan 18, 2024

github-actions bot added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 labels Jan 23, 2024

michi-covalent mentioned this pull request Feb 13, 2024

Prepare for release v1.13.12 #30730

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bpf: Add 'host not ready' drop reason if host policy cannot be enforced #29482

bpf: Add 'host not ready' drop reason if host policy cannot be enforced #29482

Uh oh!

ti-mo commented Nov 29, 2023 •

edited

Loading

Uh oh!

ti-mo commented Nov 29, 2023

Uh oh!

ti-mo commented Nov 29, 2023

Uh oh!

ti-mo commented Nov 29, 2023

Uh oh!

ti-mo commented Nov 29, 2023

Uh oh!

ti-mo commented Nov 30, 2023

Uh oh!

ti-mo commented Nov 30, 2023

Uh oh!

ti-mo commented Nov 30, 2023

Uh oh!

ti-mo commented Nov 30, 2023

Uh oh!

ti-mo commented Dec 4, 2023

Uh oh!

brlbil left a comment

Uh oh!

Uh oh!

Uh oh!

bpf: Add 'host not ready' drop reason if host policy cannot be enforced #29482

bpf: Add 'host not ready' drop reason if host policy cannot be enforced #29482

Uh oh!

Conversation

ti-mo commented Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ti-mo commented Nov 29, 2023

Uh oh!

ti-mo commented Nov 29, 2023

Uh oh!

ti-mo commented Nov 29, 2023

Uh oh!

ti-mo commented Nov 29, 2023

Uh oh!

ti-mo commented Nov 30, 2023

Uh oh!

ti-mo commented Nov 30, 2023

Uh oh!

ti-mo commented Nov 30, 2023

Uh oh!

ti-mo commented Nov 30, 2023

Uh oh!

ti-mo commented Dec 4, 2023

Uh oh!

brlbil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ti-mo commented Nov 29, 2023 •

edited

Loading