Skip to content

Conversation

dylandreimerink
Copy link
Member

@dylandreimerink dylandreimerink commented Sep 14, 2023

So far we have been dropping packets meant for services which do not have endpoints/backends. This causes clients to needlessly wait for replies and to retry sending traffic. This PR adds the ability to send back an ICMP or ICMPv6 reply with Destination unreachable (type 3) + Port unreachable (code 3) whenever this happens.

This behavior is controllable via a new --service-no-backend-response flag, which defaults to reject so we match expected behavior by default. It can also be set to drop to preserve the existing behavior in case that was desired.

This new behavior works for both North/South traffic entering a node and East/West traffic responding to a request from a pod within the cluster.

Fixes: #10002

Respond with ICMP reply for traffic to services without backends

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 14, 2023
@dylandreimerink dylandreimerink added kind/enhancement This would improve or streamline existing functionality. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. labels Sep 14, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 14, 2023
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from 06ff822 to 9ef142b Compare September 14, 2023 10:04
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from 9ef142b to cd1a0bc Compare September 14, 2023 10:51
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from cd1a0bc to 982335a Compare September 14, 2023 11:48
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from 982335a to e333851 Compare September 15, 2023 11:30
@dylandreimerink
Copy link
Member Author

/ci-l4lb

@dylandreimerink dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from e333851 to 4dbf77a Compare September 15, 2023 11:55
@dylandreimerink
Copy link
Member Author

/ci-l4lb

@dylandreimerink dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch 2 times, most recently from 18834c0 to a9a4a6c Compare September 15, 2023 12:22
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink marked this pull request as ready for review September 25, 2023 15:39
@dylandreimerink dylandreimerink requested review from a team as code owners September 25, 2023 15:39
Merged via the queue into main with commit 93f1619 Nov 23, 2023
@dylandreimerink dylandreimerink deleted the pr/dylan/svc-no-backend-response branch November 23, 2023 17:20
@aojea
Copy link
Contributor

aojea commented Nov 23, 2023

lovely

@aojea
Copy link
Contributor

aojea commented Nov 23, 2023

This differs slightly from the kube-proxy behavior which would send back and type 3/code 3 port unreachable, given its the whole IP not a single port the address unreachable code made more sense

A funny thing is that it can be technically possible, that in the same Service, one Port has backends and other not, but this is an artifact of the capability of using named ports for Services , but as Tim says in , that is a pretty esoteric configuration kubernetes/kubernetes#24875 (comment) so I think that this is correct

@sayboras sayboras added affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch labels Nov 24, 2023
@dylandreimerink
Copy link
Member Author

This differs slightly from the kube-proxy behavior which would send back and type 3/code 3 port unreachable, given its the whole IP not a single port the address unreachable code made more sense

A funny thing is that it can be technically possible, that in the same Service, one Port has backends and other not, but this is an artifact of the capability of using named ports for Services , but as Tim says in , that is a pretty esoteric configuration kubernetes/kubernetes#24875 (comment) so I think that this is correct

Ah, the description of the PR is out of date, we changed this after review to match exactly what kube-proxy does to avoid potential implementation differences of the ICMP code for clients. I will correct the description so it doesn't cause future confusion.

Copy link
Contributor

@gentoo-root gentoo-root left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm too late for the party, but I've got some comments on the TBF implementation 😅

since_last_topup = ktime_get_ns() - value->last_topup;
if (since_last_topup > settings->topup_interval_ns) {
/* Add tokens of every missed interval */
value->tokens += (since_last_topup / settings->topup_interval_ns) *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rounding here could skip intervals. For example, it this function is called at 0 s, 1.5 s and 3 s, it will add 1000 tokens at 0 s, another 1000 tokens at 1.5 s, and yet another 1000 tokens at 3 s = in total 3000 tokens. If, however, this function was called at 0 s, 1 s, 2 s and 3 s, it would add 1000 tokens at each call = in total 4000 tokens over the same time period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. So a better way to keep track would be:

long cur_time = ktime_get_ns();
[...]
long intervals = since_last_topup / settings->topup_interval_ns;
long remainder = since_last_topup % settings->topup_interval_ns;
value->last_topup = cur_time - remainder;

So a 1.5s, we set last_topup to 1s instead of 1.5s. So at the 3s mark we would add 2000 instead of 1000.

Is that correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me.

if (!value) {
new_value.last_topup = ktime_get_ns();
new_value.tokens = settings->tokens_per_topup - 1;
ret = map_update_elem(&RATELIMIT_MAP, key, &new_value, BPF_ANY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lookup-and-update is racy if called from two CPUs. Do we care?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this. It would mean that the ratelimit isn't 100% accurate, letting more traffic through than the limit. To fix that we would need to use atomics which are slow. I thought the performance is more important in this given situation. Perhaps we should add this to the comments in case other wonder.

/* Add tokens of every missed interval */
value->tokens += (since_last_topup / settings->topup_interval_ns) *
settings->tokens_per_topup;
value->last_topup = ktime_get_ns();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reuse the ktime_get_ns() value fetched above, otherwise it's another source of inexactness, although only a tiny one.

@dylandreimerink
Copy link
Member Author

I'm too late for the party, but I've got some comments on the TBF implementation

I will make a follow-up PR for that, thanks for the feedback!

@joestringer joestringer added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. and removed release-note/misc This PR makes changes that have no direct user impact. labels Nov 27, 2023
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments below - better late than never :).

@@ -597,7 +599,7 @@ enum {
#define DROP_INVALID_EXTHDR -156
#define DROP_FRAG_NOSUPPORT -157
#define DROP_NO_SERVICE -158
#define DROP_UNUSED8 -159 /* unused */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylandreimerink This drop reason wasn't added in flow.proto and drop.go.

Under normal circumstances, we shouldn't reuse any of these, since renaming a proto field/type causes a backwards-incompatible change. We're lucky that in this case, drop reason 159 is actually missing from the proto as well as from drop.go. 😅

    SERVICE_BACKEND_NOT_FOUND = 158;
    NO_TUNNEL_OR_ENCAPSULATION_ENDPOINT = 160;

In any case, I'm marking all unused ones as deprecated in #29482.

cc @rolinh

Comment on lines +2279 to +2282
# -- Configure what the response should be to traffic for a service without backends.
# "reject" only works on kernels >= 5.10, on lower kernels we fallback to "drop".
# Possible values:
# - reject (default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylandreimerink is the >= 5.10 requirement correct? We later say that the required BPF feature (BPF_ADJ_ROOM_MAC support in bpf_skb_adjust_room()) exists since kernel 5.2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe you are correct. That should be 5.2, I am not sure how 5.10 ended up there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, I glanced at a 5.4 sysdump and it looks all golden there. No error message from the probe.

I'll open a PR to just remove this on v1.16+, given all supported kernels should be providing the option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Feb 20, 2025
The LB path sends an ICMP error message when the accessed service has no
backend available (see cilium#28157).

In the case of a nodeport service, validate that this ICMP error message
can exit the node (and traverse the `to-netdev` program) when
BPF masquerading is enabled.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Feb 20, 2025
The LB path sends an ICMP error message when the accessed service has no
backend available (see cilium#28157).

In the case of a nodeport service, validate that this ICMP error message
can exit the node (and traverse the `to-netdev` program) when
BPF masquerading is enabled.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Feb 21, 2025
The LB path sends an ICMP error message when the accessed service has no
backend available (see cilium#28157).

In the case of a nodeport service, validate that this ICMP error message
can exit the node (and traverse the `to-netdev` program) when
BPF masquerading is enabled.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Feb 26, 2025
The LB path sends an ICMP error message when the accessed service has no
backend available (see cilium#28157).

In the case of a nodeport service, validate that this ICMP error message
can exit the node (and traverse the `to-netdev` program) when
BPF masquerading is enabled.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Feb 26, 2025
The LB path sends an ICMP error message when the accessed service has no
backend available (see #28157).

In the case of a nodeport service, validate that this ICMP error message
can exit the node (and traverse the `to-netdev` program) when
BPF masquerading is enabled.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request May 29, 2025
[ upstream commit bbb2a87 ]

The LB path sends an ICMP error message when the accessed service has no
backend available (see cilium#28157).

In the case of a nodeport service, validate that this ICMP error message
can exit the node (and traverse the `to-netdev` program) when
BPF masquerading is enabled.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/loadbalancing Impacts load-balancing and Kubernetes service implementations kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failures in [sig-network] Network should set TCP CLOSE_WAIT timeout and Services should be rejected when no endpoints exist