Skip to content

Conversation

atykhyy
Copy link
Contributor

@atykhyy atykhyy commented Jul 3, 2025

When L7 DNS proxy for nodes is enabled and kube-apiserver is a FQDN, Cilium agent deadlocks on restart, because k8sClientset's start hook which checks connection to kube-apiserver and verifies the version runs before the daemon bootstraps the DNS proxy with restored endpoints. This commit adds a daemon command-line option which configures the dialer of k8sClientset to bypass DNS proxy and host firewall in the same manner as DNS proxy's requests to remote DNS servers do. Since Cilium trusts both DNS and kube-apiserver responses, exempting the daemon's connections to kube-apiserver from host firewall does not degrade security.

Note: both DNS resolution of and actual connection to kube-apiserver are exempted. If only DNS resolution of kube-apiserver's address is exempted, DNS proxy would not create correct IP-based rules for its address, and if kube-apiserver's IP address changes while the Cilium daemon is restarting, kube-apiserver will not be accessible and the daemon will fail on startup.

Fixes: #35433

@atykhyy atykhyy requested review from a team as code owners July 3, 2025 13:11
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jul 3, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 3, 2025
@atykhyy atykhyy force-pushed the enable-k8s-host-firewall-bypass branch from 0124bcf to a4d46ed Compare July 3, 2025 13:56
Copy link
Member

@asauber asauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The justification for this implementation in #35433 looks reasonable to me. Approving for the configuration implementation. One comment about the readability.

@atykhyy atykhyy requested a review from a team as a code owner July 4, 2025 05:56
@atykhyy atykhyy requested a review from bimmlerd July 4, 2025 05:56
Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this approach, just some implementation nits

cc @joamaki and @joestringer as you've interacted with the issue this PR fixes

@pchaigno pchaigno added release-note/bug This PR fixes an issue in a previous release of Cilium. area/host-firewall Impacts the host firewall or the host endpoint. needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jul 7, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 7, 2025
@atykhyy
Copy link
Contributor Author

atykhyy commented Jul 10, 2025

Raising this question to top level to solicit reviewers' opinions:
I made this bypass an opt-in out of caution (i.e. not to break things elsewhere in Cilium which I don't understand that well). However, the bypass must be turned on whenever the following three conditions are met:

  1. kube-apiserver address is an FQDN
  2. host firewall is enabled
  3. a Cilium clusterwide network policy targeting nodes has a L7 DNS rule (toy example below)
- egress:
  - toCIDR:
    - 168.63.129.16/32
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      rules:
        dns:
        - matchPattern: '*'
  nodeSelector: {}

If these conditions are met and the bypass is not enabled, Cilium agent will break on restart (e.g. when it is upgraded), making the whole cluster dangerously unstable to unusable. My concern was that the bypass might have unintended bad effects in circumstances of which I am unaware, but perhaps it is misplaced? After all, DNS proxy has been decorating its dialer for remote DNS requests in the same way this PR does for k8s client dialer for years and years and has not caused problems. If so, the bypass ought to be enabled whenever host firewall is enabled (without an option to turn it off). An intermediate variant is to make the bypass opt-out rather than opt-in, so it can be turned off cheaply if it does happen to break something.

@dylandreimerink
Copy link
Member

/test

@atykhyy atykhyy force-pushed the enable-k8s-host-firewall-bypass branch from fafd8be to 21cee05 Compare July 14, 2025 10:56
@atykhyy
Copy link
Contributor Author

atykhyy commented Jul 14, 2025

Rebased changes on top of main to hopefully fix conformance tests, and made the bypass option on by default.

@joestringer
Copy link
Member

/test

@atykhyy atykhyy requested a review from a team as a code owner July 17, 2025 10:20
@atykhyy atykhyy requested a review from squeed July 17, 2025 10:20
@atykhyy atykhyy force-pushed the enable-k8s-host-firewall-bypass branch from b7f92e1 to 373021e Compare July 17, 2025 11:06
@joestringer
Copy link
Member

@atykhyy no worries, I appreciate your patience with the review :) I'll trigger the tests & queue for automerge.

@joestringer
Copy link
Member

/test

@joestringer joestringer enabled auto-merge July 18, 2025 19:46
@joestringer joestringer added this pull request to the merge queue Jul 18, 2025
Merged via the queue into cilium:main with commit cf3889a Jul 18, 2025
68 checks passed
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jul 18, 2025
@YutaroHayakawa YutaroHayakawa mentioned this pull request Jul 22, 2025
10 tasks
@YutaroHayakawa YutaroHayakawa added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jul 22, 2025
marseel added a commit that referenced this pull request Jul 24, 2025
As we have seen suspicious connectivity problems with k8s apiserver
since merging #40346 let's disable it for time being to see if it
resolves problems in CI.

Related: #40682

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
@marseel
Copy link
Contributor

marseel commented Jul 24, 2025

Hi all,
since merging this PR, we have observed various connectivity problems / delays with k8s apiserver: #40682 which started to happen right after merging this PR - even in case of CI runs without hostFirewall etc.

For now, I would like to propose skipping backport to v1.18: #40665 (review) as during the backport, we have also seen some connectivity problems to k8s apiserver.
Let's also disable it for main: #40691 to confirm if this PR is the root-cause or something else.
cc @atykhyy @joestringer

@atykhyy
Copy link
Contributor Author

atykhyy commented Jul 24, 2025

If the connection to kube-apiserver was getting dropped, as discussed in the CI flake issue, that should be visible in cilium-agent logs, shouldn't it? Unfortunately I don't have permissions to download log artifacts to check for this (assuming these logs are there).

@joamaki
Copy link
Contributor

joamaki commented Jul 24, 2025

If the connection to kube-apiserver was getting dropped, as discussed in the CI flake issue, that should be visible in cilium-agent logs, shouldn't it? Unfortunately I don't have permissions to download log artifacts to check for this (assuming these logs are there).

Only in some cases do we log anything. If the connection fails on Watch calls then nothing is logged (things just get delayed). We have also seen #40687 where an Update failed due to broken connection. And we've seen "Heartbeat timed out, restarting client connections".

@marseel
Copy link
Contributor

marseel commented Jul 24, 2025

Unfortunately I don't have permissions to download log artifacts to check for this (assuming these logs are there).

I've included one of the sysdumps in the issue - previously I did not notice that it was not uploaded due to size limit. You should be able to download it from the issue now.

As Jussi says, there is not clear logs that would indicated broken watch. This is mostly based on timing when failures started to occur in CI and this PR was merged + area that this PR was changing + most likely area that was causing issues - it's a bit of a guess.

For now, I just want to disable it for a couple of days to check if it resolves issue, which would confirm that it was caused by this PR. If not, I will revert my PR that disables it.

@atykhyy
Copy link
Contributor Author

atykhyy commented Jul 24, 2025

By all means. But why would tagging k8s-clientset's outgoing packets with MagicMarkEgress make the connection flaky? Looking through that dump, it gets established without any delay worth speaking of:

2025-07-23T05:16:38.204463822Z time=2025-07-23T05:16:38.20428954Z level=debug msg="Skipped reading configuration file" error="Config File \"cilium\" Not Found in \"[/root]\""
# snip
2025-07-23T05:16:38.674517739Z time=2025-07-23T05:16:38.674446687Z level=info source=/go/src/github.com/cilium/cilium/pkg/k8s/client/cell.go:332 msg="Establishing connection to apiserver" module=agent.infra.k8s»
2025-07-23T05:16:38.681590372Z time=2025-07-23T05:16:38.681485483Z level=info source=/go/src/github.com/cilium/cilium/pkg/k8s/client/cell.go:359 msg="Connected to apiserver" module=agent.infra.k8s-client

The only thing that looks suspicious to me is that iptables rules disable conntrack when they see this mark. Could that be it? But then wouldn't DNS proxy that uses the same mark have the same sort of problem (it would manifest in a different place)?

@joamaki
Copy link
Contributor

joamaki commented Jul 24, 2025

By all means. But why would tagging k8s-clientset's outgoing packets with MagicMarkEgress make the connection flaky? Looking through that dump, it gets established without any delay worth speaking of:

2025-07-23T05:16:38.204463822Z time=2025-07-23T05:16:38.20428954Z level=debug msg="Skipped reading configuration file" error="Config File \"cilium\" Not Found in \"[/root]\""
# snip
2025-07-23T05:16:38.674517739Z time=2025-07-23T05:16:38.674446687Z level=info source=/go/src/github.com/cilium/cilium/pkg/k8s/client/cell.go:332 msg="Establishing connection to apiserver" module=agent.infra.k8s»
2025-07-23T05:16:38.681590372Z time=2025-07-23T05:16:38.681485483Z level=info source=/go/src/github.com/cilium/cilium/pkg/k8s/client/cell.go:359 msg="Connected to apiserver" module=agent.infra.k8s-client

The only thing that looks suspicious to me is that iptables rules disable conntrack when they see this mark. Could that be it? But then wouldn't DNS proxy that uses the same mark have the same sort of problem (it would manifest in a different place)?

The initial connection attempt does succeed in all the failed tests, but that connection later breaks (couple minutes into it). Could well be that setting those iptables rules will make existing connections with that mark break?
Would this be easy to test?

The DNS proxy probably wouldn't be affected as it'd mostly be processing requests after the iptables rules have been set up.

github-merge-queue bot pushed a commit that referenced this pull request Jul 24, 2025
As we have seen suspicious connectivity problems with k8s apiserver
since merging #40346 let's disable it for time being to see if it
resolves problems in CI.

Related: #40682

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
@atykhyy
Copy link
Contributor Author

atykhyy commented Jul 24, 2025

Besides those iptables conttack rules there are also these routing table rules:

10:	from all fwmark 0xb00/0xf00 lookup 2005 proto kernel
# table 2005:
default via 10.244.0.14 dev cilium_host proto kernel mtu 1407 
10.244.0.14 dev cilium_host proto kernel scope link 

I don't know enough say if creating these iptables or routing rules would break an existing TCP connection. If k8s client logged when it broke, one could see if the timing matches with what the datapath module is doing. Also, if rule changes caused 'old' packets to get dropped, wouldn't it take the TCP stack time to detect that the connection is no longer alive and signal an error on k8s client's socket? k8s-client-connection-keep-alive and k8s-client-connection-timeout are both set to 30s by default and in that CI run, add them together and it's one minute. Perhaps that's the source of the delay that is causing this near-startup flakiness? Would reducing these so that the broken connection is detected and recycled sooner help?

As an aside, @jrife commented in the CI issue that this flakiness is not a security problem. If there is no way to get rid of the flakiness, I'm perfectly fine with leaving the option added by this PR as opt-in, only to be used in the specific Cilium mode of operation where it is required (which is not yet documented anyway).

@joestringer joestringer removed the backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. label Jul 24, 2025
@joestringer joestringer added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. and removed release-note/bug This PR fixes an issue in a previous release of Cilium. labels Aug 5, 2025
rabelmervin pushed a commit to rabelmervin/cilium that referenced this pull request Aug 18, 2025
As we have seen suspicious connectivity problems with k8s apiserver
since merging cilium#40346 let's disable it for time being to see if it
resolves problems in CI.

Related: cilium#40682

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/host-firewall Impacts the host firewall or the host endpoint. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent deadlocks on startup when L7 DNS proxy for nodes is enabled and kube-apiserver is a FQDN