-
Notifications
You must be signed in to change notification settings - Fork 3.4k
daemon: Start DNS proxy before k8s with restored rules #12964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9075612
to
a129374
Compare
test-me-please |
a129374
to
a40e957
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stuck to my codeowners for this review but also we reviewed in detail for the original submission so I don't think it's such a big deal.
There's just one note below which accounts for a change in master
branch since this PR was proposed. We'll need to co-ordinate on followup for that condition to prevent regression.
func (m *IptablesManager) GetProxyPort(name string) uint16 { | ||
prog := "iptables" | ||
if !option.Config.EnableIPv4 { | ||
prog = "ip6tables" | ||
} | ||
|
||
res, err := runProgCombinedOutput(prog, []string{"-t", "mangle", "-n", "-L", ciliumPreMangleChain}, false) | ||
if err != nil { | ||
return 0 | ||
} | ||
|
||
re := regexp.MustCompile(name + ".*TPROXY redirect 0.0.0.0:([1-9][0-9]*) mark") | ||
str := re.FindString(string(res)) | ||
portStr := re.ReplaceAllString(str, "$1") | ||
portInt, err := strconv.Atoi(portStr) | ||
if err != nil { | ||
log.WithError(err).Debugf("Port number cannot be parsed: %s", portStr) | ||
return 0 | ||
} | ||
return uint16(portInt) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this should be good for v1.8 but for latest with bpf tproxy it will not work.
Probably makes sense to merge the rest of this PR as-is and have a separate commit to address the tproxy case. Not sure whether latest code will pass with this broken (maybe it'll pass if we don't have an actual runtime test for this condition..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joestringer Any hints how to do this with the new bpf tproxy? Is there a header file left over by the previous run we could parse instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the port is only placed into BPF policy maps for each endpoint, which isn't the most reliable way to store/restore proxy ports for all protocols.
I think we could consider writing each proxy port to one of the config files, purely for restore purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed #13208 to follow up.
CI tests were green but I'm not sure whether |
retest-net-next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we could we test this in the e2e test? E.g., by testing connectivity earlier in Restart Cilium validate that FQDN is still working
? Not sure how we would detect that proxy should be ready and it's okay to test connectivity now.
Start proxy support earlier in the daemon bootstrap, notably before any k8s setup. Fetch old endpoints earlier so that the DNS history is avalailable before k8s is set up and move dns proxy initialization earlier in the bootstrap. Reuse DNS proxy port from previous run on restart unless overridden by an explicit Cilium agent option These changes allow the DNS proxy start serving requests as soon as the toFQDN policy is received from k8s and avoid any service disruption prevously possible due to endpoints being regenerated before the DNS proxy was started. Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Store current DNS rules with the Endpoint and use them in the DNS proxy during initial regeneration of the restrored endpoints. DNS proxy starts with restored DNS rules based on allowed IP addresses. These rules are removed for each endpoint as soon as the first regeneration completes. Such restored rules should allow DNS requests to be served, but for new DNS resolutions to be added to the Endpoint's policy the restored endpoint's must still have their first regeneration completed. Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Update DNSRules, if any, before writing headers to capture potentially changed allowed destination IPs. Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Remove restored DNS rules after a successful regeneration, and also at endpoint delete to cover endpoints that were never regenerated. Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Normally the number of DNS proxy rules should be very small. To guard against pathological cases, limit the number of IPs processed to 1000 per port. Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Use restored Endpoints during Cilium restart when Endpoints are no yet available. Do not error out if the destination IP can not be found from ipcache, but default to WORLD destination security identity instead. This allows IP-based restored rules to be processed before ipcache is fully updated. Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
"Proxy stats not found when updating" warnigns are currently issued if stats updates are received for a proxy redirect that can not be found. There are two common scenarios where this can happen as part of normal operation: 1. A policy change removed a proxy redirect, and stats updates from requests that were redirected to the proxy before the datapath redirect entry was removed are received. 2. DNS proxy issues stats for requests that have been forwarded on the basis of a restored DNS policy while the Endpoint policy has not yet been computed. Demote this log message to debug level to avoid these false warnings. Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
a40e957
to
7c41d5c
Compare
removeRestoredRules() requires locking, better reflect that in the function name. Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Re-run connectivity test while Cilium is still restarting. This should succeed as the same DNS names were used in a connectivity test before the restart. Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
266aede
to
782cc71
Compare
test-me-please |
Note: This is a forward-port of #12718 and #12731, which were developed on v1.7 branch.
Split the endpoint restoration process into more stages so that the saved endpoint DNS history can be used earlier in the bootstrap to start the DNS proxy. This helps ensure that the first regeneration of an Endpoint does not (temporarily) remove the policy keys needed to keep old traffic flowing.
TPROXY rules for the DNS proxy can only be installed couple of seconds later in the bootstrap process. To mitigate this and allow the DNS proxy start serving traffic sooner, it now reads the old TPROXY rule, if available, and tries to re-use the same proxy port number as before the restart. This allows the old datapath config to start feeding traffic to the new proxy instance even before the datapath config can be rewritten.
DNS proxy starts with restored DNS rules based on allowed IP addresses. These rules are removed for each endpoint as soon as the first regeneration completes. Such restored rules should allow DNS requests to be served, but for new DNS resolutions to be added to the Endpoint's policy the restored endpoint's must still have their first regeneration completed.
Fixes: #11004