Skip to content

Conversation

ti-mo
Copy link
Contributor

@ti-mo ti-mo commented Nov 29, 2023

Repurpose unused drop reasons since 61fb508
("bpf: Rename unused drop defines to DROP_UNUSED*") for signaling the host
endpoint's policy program was attempted to be executed before it was loaded.

bpf_lxc.c contains multiple tail calls into POLICY_CALL_MAP at the HOST_EP_ID
slot. The program in this slot is provided by bpf_host.c. During first agent
startup, there are often multiple Pods pending creation due to no CNI being
available. As soon as the agent's local API becomes available, these outstanding
CNI requests have a chance to be accepted by the API handler.

If one such request is serviced before the host datapath controller has a chance
to grab the compilation lock, the endpoint program will compile and attach first.
If the host firewall is enabled, and this new workload endpoint sends a packet
to the host, the host's ingress policy needs to be enforced. However, because
bpf_host hasn't been loaded yet, this policy program is not yet present in
POLICY_CALL_MAP, resulting in a missed tail call.

One potential solution to this problem would be making sure the host datapath
always attaches before workload endpoints do. There's one problem with this
solution: clustermesh requires data from other clusters in order to correctly
populate the local ipcache, and the ipcache currently needs to be populated for
the host endpoint to finish attaching. It obtains this information through
clustermesh-apiserver, typically deployed onto the local node as a regular Pod.
This means workload endpoints must be able to deploy before the host endpoint.

As a stop gap, tolerate these kinds of drops and assign them a specific meaning,
without letting them spill over into the generic 'missed tail call' bucket. To
stabilize end-to-end tests, we're aiming to enforce zero dropped tail calls in
all CI scenarios, since it leads to packets that mysteriously go missing,
introducing chaos that's impossible to troubleshoot.

Add specific drop reason for missing tail calls if the host datapath is not ready yet

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 29, 2023
@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 29, 2023

/test

@ti-mo ti-mo force-pushed the tb/wait-host-datapath branch from ca3fe28 to 4bf367f Compare November 29, 2023 13:19
@ti-mo ti-mo added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Nov 29, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 29, 2023
@ti-mo ti-mo added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/agent Cilium agent related. labels Nov 29, 2023
@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 29, 2023

/test

1 similar comment
@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 29, 2023

/test

@ti-mo ti-mo force-pushed the tb/wait-host-datapath branch from 32812b0 to bc36f7d Compare November 29, 2023 15:16
@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 29, 2023

/test

@ti-mo ti-mo force-pushed the tb/wait-host-datapath branch from bc36f7d to 0895c5f Compare November 30, 2023 09:36
@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 30, 2023

/test

@ti-mo ti-mo force-pushed the tb/wait-host-datapath branch from 0895c5f to 9c60f4f Compare November 30, 2023 15:29
@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 30, 2023

/test-e2e-upgrade

@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 30, 2023

/test

@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 30, 2023

/ci-e2e-upgrade

@ti-mo ti-mo force-pushed the tb/wait-host-datapath branch 2 times, most recently from 50c61ab to ad8d210 Compare December 4, 2023 15:06
@ti-mo ti-mo changed the title [WIP] Avoid missed tail calls into handle_lxc_traffic loader: wait for host datapath to be ready before loading endpoint programs Dec 4, 2023
@ti-mo ti-mo marked this pull request as ready for review December 4, 2023 15:10
@ti-mo ti-mo requested review from a team as code owners December 4, 2023 15:10
@ti-mo
Copy link
Contributor Author

ti-mo commented Dec 4, 2023

/test

Copy link
Contributor

@brlbil brlbil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI part LGTM.

@ti-mo ti-mo added this pull request to the merge queue Jan 9, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Jan 9, 2024
@ti-mo ti-mo added this pull request to the merge queue Jan 10, 2024
Merged via the queue into main with commit a15a463 Jan 10, 2024
@ti-mo ti-mo deleted the tb/wait-host-datapath branch January 10, 2024 16:05
@ti-mo ti-mo added the backport/author The backport will be carried out by the author of the PR. label Jan 11, 2024
@ti-mo ti-mo added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 labels Jan 11, 2024
@ti-mo ti-mo added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 labels Jan 11, 2024
@gentoo-root gentoo-root added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jan 17, 2024
@ti-mo ti-mo added needs-backport/1.13 backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Jan 18, 2024
@github-actions github-actions bot added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 labels Jan 23, 2024
michi-covalent pushed a commit that referenced this pull request Jul 16, 2024
This excludes the drop reason introduced in #29482.
It occurs when Cilium is first installed on a node, the host firewall is
enabled, a workload endpoint gets created before the host endpoint, and the
workload endpoint in question tries to talk to the host.

Preventing these drops would require redesigning parts of the datapath,
particularly the clustermesh bootstrap procedure. This is not feasible
at the moment, and maybe it's not the right thing to do.

Signed-off-by: Timo Beckers <timo@isovalent.com>
michi-covalent pushed a commit that referenced this pull request Aug 5, 2024
[ cherry-picked from cilium/cilium-cli repository ]

This excludes the drop reason introduced in #29482.
It occurs when Cilium is first installed on a node, the host firewall is
enabled, a workload endpoint gets created before the host endpoint, and the
workload endpoint in question tries to talk to the host.

Preventing these drops would require redesigning parts of the datapath,
particularly the clustermesh bootstrap procedure. This is not feasible
at the moment, and maybe it's not the right thing to do.

Signed-off-by: Timo Beckers <timo@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Aug 16, 2024
[ cherry-picked from cilium/cilium-cli repository ]

This excludes the drop reason introduced in #29482.
It occurs when Cilium is first installed on a node, the host firewall is
enabled, a workload endpoint gets created before the host endpoint, and the
workload endpoint in question tries to talk to the host.

Preventing these drops would require redesigning parts of the datapath,
particularly the clustermesh bootstrap procedure. This is not feasible
at the moment, and maybe it's not the right thing to do.

Signed-off-by: Timo Beckers <timo@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent Cilium agent related. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. backport/author The backport will be carried out by the author of the PR. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
Status: Released
Development

Successfully merging this pull request may close these issues.

9 participants