Skip to content

Conversation

jrajahalme
Copy link
Member

Proxy port restored from a file upon Cilium restart port must be considered as "configured". Otherwise the redirect will be removed and added again on next policy update, which can lead to Envoy proxy flapping listening on the proxy port.

The broken behavior is seen in the (info level) logs from messages like these within milliseconds of each other:

  msg="Created new proxy instance" ProxyPort=14537 id="1853:egress:TCP:443:"
  msg="Envoy: Deleting listener" listener="cilium-http-egress:14537"
  msg="Envoy: Upserting new listener" listener="cilium-http-egress:14537"
  msg="Created new proxy instance" ProxyPort=14537 id="1853:egress:TCP:443:"

Fixes: #36142

Proxy port restored from a file upon Cilium restart port must be
considered as "configured". Otherwise the redirect will be removed and
added again on next policy update, which can lead to Envoy proxy flapping
listening on the proxy port.

The broken behavior is seen in the (info level) logs from messages like these within milliseconds of each other:

  msg="Created new proxy instance" ProxyPort=14537 id="1853:egress:TCP:443:"
  msg="Envoy: Deleting listener" listener="cilium-http-egress:14537"
  msg="Envoy: Upserting new listener" listener="cilium-http-egress:14537"
  msg="Created new proxy instance" ProxyPort=14537 id="1853:egress:TCP:443:"

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme added kind/bug This is a bug in the Cilium logic. release-blocker/1.17 This issue will prevent the release of the next version of Cilium. needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 12, 2025
@jrajahalme jrajahalme requested a review from a team as a code owner January 12, 2025 10:47
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 12, 2025
@jrajahalme
Copy link
Member Author

@aanm The broken behavior can be seen in the logs of https://github.com/cilium/cilium/actions/runs/12705832756/job/35417626081, which failed like this:

📋 Test Report [cilium-test-1]
❌ 1/4 tests failed (1/13 actions), 104 tests skipped, 0 scenarios skipped:
[cilium-test-1] 1 tests failed
Test [seq-client-egress-l7-tls-deny-without-headers]:
  ❌ seq-client-egress-l7-tls-deny-without-headers/pod-to-world-with-tls-intercept/https-to-one.one.one.one.-2: cilium-test-1/client3-795488bf5-vzqcj (10.244.1.193) -> one.one.one.one.-https (one.one.one.one.:443)

IMO it is possible that the above CI failure would be caused by the flapping listener, even though cilium connectivity test waits for the new policy to have take effect before running the test, as the endpoint's policy map indicates proxy redirections, but cilium-envoy on that node shows no incoming redirected connections (and the logs cover the time period of the test):

Endpoint ID: 1853
Path: /sys/fs/bpf/tc/globals/cilium_policy_v2_01853

POLICY   DIRECTION   IDENTITY   PORT/PROTO   PROXY PORT   AUTH TYPE   BYTES   PACKETS   PREFIX   
...
Allow    Egress      16777219   443/TCP      14537        disabled    148     2         24       
Allow    Egress      16777220   53/UDP       41409        disabled    0       0         24       
Allow    Egress      16777220   443/TCP      14537        disabled    188     2         24       
Allow    Egress      16777220   53/TCP       41409        disabled    0       0         24       

Indeed the proxy port 14537 is not present in the ss -t output for the node, while the DNS port 41409 is.

It looks like the Envoy listener warming on the affected node never completes:
grep warm on the affected node (cilium-envoy-t9vxt):

[2025-01-10 08:39:15.702] add warming listener: name=cilium-http-egress:14537, hash=18438132730555934532, tag=3, address=127.0.0.1:14537,[::1]:14537

same on a non-affected node (cilium-envoy-26vs8 ):

[2025-01-10 08:39:15.755] add warming listener: name=cilium-http-egress:13450, hash=14523648647585666348, tag=3, address=127.0.0.1:13450,[::1]:13450
[2025-01-10 08:39:15.953] warm complete. updating active listener: name=cilium-http-egress:13450, hash=14523648647585666348, tag=3, address=127.0.0.1:13450,[::1]:13450

Comparing the listener init manager logs between these two nodes, it looks like the affected Envoy instance never receives the TLS Context client secret; this is the likely reason for the listener warming not completing.

Looks like this could be an issue in our Envoy side SDS initialization logic, possibly triggered by the listener reconfig, or then something completely different.

@jrajahalme jrajahalme added the release-note/misc This PR makes changes that have no direct user impact. label Jan 12, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 12, 2025
@jrajahalme
Copy link
Member Author

/test

@aanm
Copy link
Member

aanm commented Jan 12, 2025

@jrajahalme I'm confused, it looks the CI is failing for the same reason, no?
https://github.com/cilium/cilium/actions/runs/12733182262/job/35489235280

@jrajahalme
Copy link
Member Author

@jrajahalme I'm confused, it looks the CI is failing for the same reason, no? https://github.com/cilium/cilium/actions/runs/12733182262/job/35489235280

That is a "test after downgrade" so would not have this fix :-)

@jrajahalme jrajahalme added the area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. label Jan 12, 2025
@aanm aanm enabled auto-merge January 13, 2025 04:17
@aanm aanm added this pull request to the merge queue Jan 13, 2025
Merged via the queue into cilium:main with commit 121a920 Jan 13, 2025
70 checks passed
@gandro
Copy link
Member

gandro commented Jan 13, 2025

@jrajahalme Doesn't this not also need a v1.16 backport? The referenced PR was backported to v1.16

@gandro gandro added needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch labels Jan 13, 2025
@rastislavs rastislavs mentioned this pull request Jan 21, 2025
45 tasks
@rastislavs rastislavs added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 21, 2025
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Jan 22, 2025
@rastislavs rastislavs mentioned this pull request Jan 22, 2025
19 tasks
@rastislavs rastislavs added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Jan 22, 2025
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-blocker/1.17 This issue will prevent the release of the next version of Cilium. release-note/misc This PR makes changes that have no direct user impact.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants