Skip to content

endpoint: Place IngressIPs to endpoints map #35143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 26, 2024

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Oct 1, 2024

Cilium agent manages Ingress endpoint for the purpose of policy generation for Cilium Ingress. The Ingress endpoint has IPs from the node's CIDR range like any other endpoint, but it does not have any representation at the bpf datapath, as it is only used for policy enforcement and for IP addressing at the Envoy proxy. Due to the lacking bpf datapath representation IPv6 ND does not currently work for the Ingress IPs, which causes IPv6 communication with backends not working.

Fix this by introducing minimal bpf datapath representation for the Ingress endpoint via Ingress endpoint entries in the lxcmap, which is used by the IPV6 ND implementation in the Cilium bpf datapath to enable ND advertisements to be sent when ND request for the Ingress IPv6 has been received.

A new endpoint flag ENDPOINT_F_ATHOSTNS is added to inform the datapath to deliver endpoint's traffic to the host stack, but for which ARP and IPv6 ND are still served by the bpf datapath due to the endpoint's IP being from the node's IP allocation CIDR range.

Fixes: #32980

Ingress endpoint is now included in the lxcmap so that ARP and ND6 work for them.

@jrajahalme jrajahalme added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/agent Cilium agent related. area/servicemesh GH issues or PRs regarding servicemesh labels Oct 1, 2024
@jrajahalme jrajahalme requested a review from a team as a code owner October 1, 2024 09:49
@jrajahalme jrajahalme requested a review from squeed October 1, 2024 09:49
@jrajahalme jrajahalme requested review from julianwiedmann and tklauser and removed request for squeed October 1, 2024 09:51
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

FWIW, failing "Build commits" workflow should be fixed on rebase now that #35141 was merged.

@julianwiedmann
Copy link
Member

could you explain a bit more what the Ingress IPs are? From my understanding these are somehow used by Envoy, and allocated from the PodCIDR? And there is a special identity (ReservedIdentityIngress) associated with them?

And from a BPF-routing perspective - I'd expect that these IPs reside in Host network namespace, and not in a normal pod (== veth pair) ?

@jrajahalme
Copy link
Member Author

could you explain a bit more what the Ingress IPs are? From my understanding these are somehow used by Envoy, and allocated from the PodCIDR? And there is a special identity (ReservedIdentityIngress) associated with them?

And from a BPF-routing perspective - I'd expect that these IPs reside in Host network namespace, and not in a normal pod (== veth pair) ?

Simplest way of describing Ingress IPs is: IPs (IPv4 and IPv6) allocated from the PodCIDR range of the node that are used as the source addresses of traffic forwarded by Cilium Ingress (and GW API) in the node. There is no real/full Cilium endpoint with these addresses, so there is no BPF datapath (bpf_lxc) nor bpf policy maps for them. Cilium agent computes a policy for the special entity though, and sends it to Envoy so that Envoy can enforce policy on the traffic forwarded by Cilium Ingress (or GW API).

ARP table on the destination node would normally be updated due to the traffic being sent from Cilium Ingress, but if it is flushed for any reason, or for IPv6 ND, the node with these IPs may need to be able to respond to ARP/ND to recover the flushed/missing ARP/ND table entry on the destination node.

No process in the node is listening on these addresses, so ARP/ND is only relevant for getting reply packets back to the node.

@jrajahalme
Copy link
Member Author

/test

@julianwiedmann
Copy link
Member

julianwiedmann commented Oct 1, 2024

No process in the node is listening on these addresses, so ARP/ND is only relevant for getting reply packets back to the node.

So if a local pod would receive a connection by such an IngressIP, and attempts to send a reply - where should that reply get forwarded to? With the IngressIP now visible in the local endpoint map to bpf_lxc, the relevant code needs to understand whether the reply should be delivered (a) into another pod or (b) to the host network namespace.

@julianwiedmann julianwiedmann added the area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Oct 21, 2024
@jrajahalme
Copy link
Member Author

No process in the node is listening on these addresses, so ARP/ND is only relevant for getting reply packets back to the node.

So if a local pod would receive a connection by such an IngressIP, and attempts to send a reply - where should that reply get forwarded to? With the IngressIP now visible in the local endpoint map to bpf_lxc, the relevant code needs to understand whether the reply should be delivered (a) into another pod or (b) to the host network namespace.

Are you saying that since the IngressIPs are now in the endpoints map, this lookup now succeeds and finds a non-null ep?:

		/* Lookup IPv4 address, this will return a match if:
		 *  - The destination IP address belongs to a local endpoint
		 *    managed by cilium
		 *  - The destination IP address is an IP address associated with the
		 *    host itself
		 *  - The destination IP address belongs to endpoint itself.
		 */
		ep = __lookup_ip4_endpoint(daddr);

The desired action would be to route to host. Envoy uses an IP_TRANSPARENT socket to open the connection to the destination, with the IngressIP as the source address. Therefore there is a host networking namespace socket open on the (return) address, and routing the traffic through the host stack would allow it be delivered to Envoy, due to iptables rules we already have for proxy redirect (reply) traffic.

@jrajahalme jrajahalme force-pushed the ingress-ip-endpoint-map branch from e1f3b0d to db5ce1b Compare October 22, 2024 16:30
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme force-pushed the ingress-ip-endpoint-map branch from db5ce1b to 55171aa Compare October 23, 2024 10:56
@julianwiedmann
Copy link
Member

julianwiedmann commented Oct 23, 2024

The desired action would be to route to host.

Then I think we'll need a bit more work, so the programs that handle a delivery to local endpoint (eg from-container, from-overlay) understand that they should let the packet pass through into hostns, and not attempt a delivery via policy tailcall.

Right now that works by setting the ENDPOINT_F_HOST flag on the endpoint. But I'd actually want us to use two flags - one for "lives in hostns" (to drive the delivery mechanism), a second for "is a node IP" (to decide whether BPF masq is needed).

@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme changed the title endpoint: Place endpoints without bpf to endpoints map endpoint: Place IngressIPs to endpoints map Oct 23, 2024
@jrajahalme
Copy link
Member Author

The desired action would be to route to host.

Then I think we'll need a bit more work, so the programs that handle a delivery to local endpoint (eg from-container, from-overlay) understand that they should let the packet pass through into hostns, and not attempt a delivery via policy tailcall.

Right now that works by setting the ENDPOINT_F_HOST flag on the endpoint. But I'd actually want us to use two flags - one for "lives in hostns" (to drive the delivery mechanism), a second for "is a node IP" (to decide whether BPF masq is needed).

Right, the current ENDPOINT_F_HOST causes the neighbor request to be forwarded to the hostns without sending the ND advertisement, while for this we need the opposite. Currently this seems to work OK for IPv6 ND when the ENDPOINT_F_HOST flag is not set. Are you saying that we'd need another flag that triggers the same behavior as ENDPOINT_F_HOST elsewhere in the bpf datapath (from-container, from-overlay)?

@julianwiedmann
Copy link
Member

julianwiedmann commented Oct 23, 2024

Right, the current ENDPOINT_F_HOST causes the neighbor request to be forwarded to the hostns without sending the ND advertisement, while for this we need the opposite.

You mean this spot, correct? Makes sense. Note how it currently special-cases the ROUTER_IP (which is also allocated from the PodCIDR and lives in hostns, just like your IngressIP).

Are you saying that we'd need another flag that triggers the same behavior as ENDPOINT_F_HOST elsewhere in the bpf datapath (from-container, from-overlay)?

My suggestion would be

  • s/ENDPOINT_F_HOST/ENDPOINT_F_HOSTNS, and use that whenever the packet needs to be delivered into host netns.
  • add a new ENDPOINT_F_NODE_IP. And use that for the ICMPv6 case you're trying to fix, and the BPF masq case. Potentially more :).

@jrajahalme jrajahalme force-pushed the ingress-ip-endpoint-map branch from 55171aa to ef9d451 Compare October 23, 2024 12:50
@jrajahalme jrajahalme requested a review from a team as a code owner October 23, 2024 12:50
@jrajahalme jrajahalme force-pushed the ingress-ip-endpoint-map branch from ef9d451 to eaef6e4 Compare October 23, 2024 13:24
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Jarno, looks good! Some minor remarks below.

I'd suggest to split the change into two patches:

  1. introduce ENDPOINT_F_ATHOSTNS, ENDPOINT_MASK_HOST_DELIVERY etc and motivate why it's needed.
  2. switch the Ingress endpoint to set ENDPOINT_F_ATHOSTNS.

Besides that, could you please update the PR / patch description to reflect the latest changes? Right now it's still very much focused on the IPv6 ND problem ...

Add a new endpoint flag ENDPOINT_F_ATHOSTNS that informs the datapath to
deliver endpoint's traffic to the host stack, but for which ARP and IPv6
ND are still served by the bpf datapath due to the endpoint's IP being
from the node's IP allocation CIDR range.

This can be used for enabling IPv6 ND for Ingress IPs in future.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Even endpoints without bpf programs or policy need ARP and IPv6 ND to
work so that traffic can reach them. Place such endpoints in the bpf
endpoints map (aka lxcmap).

This fixes missing IPv6 ND responses for the Ingress IPs.

Fixes: cilium#32980
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme force-pushed the ingress-ip-endpoint-map branch from 1b18e82 to fca8c95 Compare October 25, 2024 19:29
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme enabled auto-merge October 26, 2024 10:02
@jrajahalme jrajahalme added this pull request to the merge queue Oct 26, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 26, 2024
Merged via the queue into cilium:main with commit fd50af6 Oct 26, 2024
64 checks passed
@jrajahalme jrajahalme deleted the ingress-ip-endpoint-map branch October 26, 2024 10:45
@julianwiedmann julianwiedmann added affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Oct 29, 2024
@julianwiedmann
Copy link
Member

Adding backport label, in particular as this is needed to unblock additional fixes (#35098, #35418).

@joamaki joamaki mentioned this pull request Nov 5, 2024
23 tasks
@joamaki joamaki added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Nov 5, 2024
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Nov 7, 2024
sayboras added a commit that referenced this pull request Dec 23, 2024
This limitation can be removed after #35143.

Relates: #35143
Relates: #24318
Signed-off-by: Tam Mach <tam.mach@cilium.io>
sayboras added a commit that referenced this pull request Jan 9, 2025
This limitation can be removed after #35143.

Relates: #35143
Relates: #24318
Signed-off-by: Tam Mach <tam.mach@cilium.io>
github-merge-queue bot pushed a commit that referenced this pull request Jan 9, 2025
This limitation can be removed after #35143.

Relates: #35143
Relates: #24318
Signed-off-by: Tam Mach <tam.mach@cilium.io>
rastislavs pushed a commit that referenced this pull request Jan 21, 2025
[ upstream commit 900cb76 ]

This limitation can be removed after #35143.

Relates: #35143
Relates: #24318
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Jan 22, 2025
[ upstream commit 900cb76 ]

This limitation can be removed after #35143.

Relates: #35143
Relates: #24318
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/agent Cilium agent related. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/servicemesh GH issues or PRs regarding servicemesh backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cilium does not send ICMPv6 NA for reserved:ingress endpoints
4 participants