-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Respond with ICMP reply for traffic to services without backends #28157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/test |
06ff822
to
9ef142b
Compare
/test |
9ef142b
to
cd1a0bc
Compare
/test |
cd1a0bc
to
982335a
Compare
/test |
982335a
to
e333851
Compare
/ci-l4lb |
e333851
to
4dbf77a
Compare
/ci-l4lb |
18834c0
to
a9a4a6c
Compare
/test |
lovely |
A funny thing is that it can be technically possible, that in the same Service, one Port has backends and other not, but this is an artifact of the capability of using named ports for Services , but as Tim says in , that is a pretty esoteric configuration kubernetes/kubernetes#24875 (comment) so I think that this is correct |
Ah, the description of the PR is out of date, we changed this after review to match exactly what kube-proxy does to avoid potential implementation differences of the ICMP code for clients. I will correct the description so it doesn't cause future confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm too late for the party, but I've got some comments on the TBF implementation 😅
since_last_topup = ktime_get_ns() - value->last_topup; | ||
if (since_last_topup > settings->topup_interval_ns) { | ||
/* Add tokens of every missed interval */ | ||
value->tokens += (since_last_topup / settings->topup_interval_ns) * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rounding here could skip intervals. For example, it this function is called at 0 s, 1.5 s and 3 s, it will add 1000 tokens at 0 s, another 1000 tokens at 1.5 s, and yet another 1000 tokens at 3 s = in total 3000 tokens. If, however, this function was called at 0 s, 1 s, 2 s and 3 s, it would add 1000 tokens at each call = in total 4000 tokens over the same time period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. So a better way to keep track would be:
long cur_time = ktime_get_ns();
[...]
long intervals = since_last_topup / settings->topup_interval_ns;
long remainder = since_last_topup % settings->topup_interval_ns;
value->last_topup = cur_time - remainder;
So a 1.5s, we set last_topup
to 1s
instead of 1.5s
. So at the 3s mark we would add 2000 instead of 1000.
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct to me.
if (!value) { | ||
new_value.last_topup = ktime_get_ns(); | ||
new_value.tokens = settings->tokens_per_topup - 1; | ||
ret = map_update_elem(&RATELIMIT_MAP, key, &new_value, BPF_ANY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lookup-and-update is racy if called from two CPUs. Do we care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this. It would mean that the ratelimit isn't 100% accurate, letting more traffic through than the limit. To fix that we would need to use atomics which are slow. I thought the performance is more important in this given situation. Perhaps we should add this to the comments in case other wonder.
/* Add tokens of every missed interval */ | ||
value->tokens += (since_last_topup / settings->topup_interval_ns) * | ||
settings->tokens_per_topup; | ||
value->last_topup = ktime_get_ns(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should reuse the ktime_get_ns() value fetched above, otherwise it's another source of inexactness, although only a tiny one.
I will make a follow-up PR for that, thanks for the feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments below - better late than never :).
@@ -597,7 +599,7 @@ enum { | |||
#define DROP_INVALID_EXTHDR -156 | |||
#define DROP_FRAG_NOSUPPORT -157 | |||
#define DROP_NO_SERVICE -158 | |||
#define DROP_UNUSED8 -159 /* unused */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dylandreimerink This drop reason wasn't added in flow.proto
and drop.go
.
Under normal circumstances, we shouldn't reuse any of these, since renaming a proto field/type causes a backwards-incompatible change. We're lucky that in this case, drop reason 159
is actually missing from the proto as well as from drop.go. 😅
SERVICE_BACKEND_NOT_FOUND = 158;
NO_TUNNEL_OR_ENCAPSULATION_ENDPOINT = 160;
In any case, I'm marking all unused ones as deprecated in #29482.
cc @rolinh
The desired behavior was implemented in #28157, and this workflow should always be running in an environment where the feature is enabled by default. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
# -- Configure what the response should be to traffic for a service without backends. | ||
# "reject" only works on kernels >= 5.10, on lower kernels we fallback to "drop". | ||
# Possible values: | ||
# - reject (default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dylandreimerink is the >= 5.10 requirement correct? We later say that the required BPF feature (BPF_ADJ_ROOM_MAC
support in bpf_skb_adjust_room()
) exists since kernel 5.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe you are correct. That should be 5.2, I am not sure how 5.10 ended up there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, I glanced at a 5.4 sysdump and it looks all golden there. No error message from the probe.
I'll open a PR to just remove this on v1.16+, given all supported kernels should be providing the option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LB path sends an ICMP error message when the accessed service has no backend available (see cilium#28157). In the case of a nodeport service, validate that this ICMP error message can exit the node (and traverse the `to-netdev` program) when BPF masquerading is enabled. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
The LB path sends an ICMP error message when the accessed service has no backend available (see cilium#28157). In the case of a nodeport service, validate that this ICMP error message can exit the node (and traverse the `to-netdev` program) when BPF masquerading is enabled. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
The LB path sends an ICMP error message when the accessed service has no backend available (see cilium#28157). In the case of a nodeport service, validate that this ICMP error message can exit the node (and traverse the `to-netdev` program) when BPF masquerading is enabled. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
The LB path sends an ICMP error message when the accessed service has no backend available (see cilium#28157). In the case of a nodeport service, validate that this ICMP error message can exit the node (and traverse the `to-netdev` program) when BPF masquerading is enabled. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
The LB path sends an ICMP error message when the accessed service has no backend available (see #28157). In the case of a nodeport service, validate that this ICMP error message can exit the node (and traverse the `to-netdev` program) when BPF masquerading is enabled. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit bbb2a87 ] The LB path sends an ICMP error message when the accessed service has no backend available (see cilium#28157). In the case of a nodeport service, validate that this ICMP error message can exit the node (and traverse the `to-netdev` program) when BPF masquerading is enabled. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
So far we have been dropping packets meant for services which do not have endpoints/backends. This causes clients to needlessly wait for replies and to retry sending traffic. This PR adds the ability to send back an ICMP or ICMPv6 reply with Destination unreachable (type 3) + Port unreachable (code 3) whenever this happens.
This behavior is controllable via a new
--service-no-backend-response
flag, which defaults toreject
so we match expected behavior by default. It can also be set todrop
to preserve the existing behavior in case that was desired.This new behavior works for both North/South traffic entering a node and East/West traffic responding to a request from a pod within the cluster.
Fixes: #10002