Skip to content

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Nov 25, 2024

Endpoint regeneration can fail for many reasons, only one of which is Envoy NACKing the listener configuration. Previously it was possible for a proxy port (e.g., DNS proxy port) to be released (and subsequently reallocated) if endpoint regeneration adding a new redirect (such as endpoint restore on agent restart) happened to fail for any reason. Handle proxy port ACK/NACK directly with an AddListener callback and only reset the proxy port if there are no redirects on the port and a NACK was received. Since envoy.AddListener is not used for DNS proxy redirects, it follows that DNS proxy port is now never reset. The AddListener callback allows to keep even newly configured redirects if the regeneration failure reason is not a NACK from Envoy. This helps reduce churn on Envoy listener configuration and the associated datapath proxy port redirection when endpoint regeneration fails for other reasons than Envoy Listener NACK.

To make the AddListener callback work we had to add a separate mutex for the proxy ports, so that a proxy mutex can be held without holding the proxy ports mutex while calling AddListener with a callback that then needs to take the proxy ports mutex. This also happens to solve a latent deadlock bug in CEC parser that also needs to take the mutex for acknowledging a proxy port. In this case we had different locking order for the Proxy.mutex and xdsServer.mutex in two different code paths:

  • CEC parser via upsertListener()
    1. takes xdsServer.mutex
    2. takes Proxy.mutex in the callback via AckProxyPort()
  • Endpoint redirect creation:
    1. takes Proxy.mutex to create a new redirect
    2. calls AddListener() which locks xdsServer.mutex

With the new Proxy.proxyPorts.mutex the above is modified so that AckProxyPort() no longer needs to take Proxy.mutex (but Proxy.proxyPorts.mutex instead), and on the latter path AddListener() is called without holding the new Proxy.proxyPorts.mutex which can then be taken in the callback function as needed.

Fixes: #36063

DNS proxy port is no longer released when endpoint with a DNS policy fails to regenerate successfully.
A potential deadlock between CEC/CCEC parser and endpoint policy update is removed.

@jrajahalme jrajahalme added kind/bug This is a bug in the Cilium logic. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. area/fqdn Affects the FQDN policies feature needs-backport/1.14 needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Nov 25, 2024
@jrajahalme jrajahalme requested review from a team as code owners November 25, 2024 08:35
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 25, 2024
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme added release-note/bug This PR fixes an issue in a previous release of Cilium. and removed kind/bug This is a bug in the Cilium logic. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 25, 2024
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme force-pushed the proxy-ack-ports-from-add-listener-callback branch from 3658d6e to 2f4bc6d Compare November 26, 2024 06:03
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme added dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. and removed needs-backport/1.14 labels Nov 26, 2024
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme force-pushed the proxy-ack-ports-from-add-listener-callback branch from 93dddbb to b54f3f8 Compare November 26, 2024 13:30
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme added the release-blocker/1.16 This issue will prevent the release of the next version of Cilium. label Dec 1, 2024
Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for @cilium/cli

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 4, 2024
@jrajahalme jrajahalme added this pull request to the merge queue Dec 4, 2024
@jrajahalme jrajahalme added the backport/author The backport will be carried out by the author of the PR. label Dec 4, 2024
Merged via the queue into cilium:main with commit 161266a Dec 4, 2024
64 checks passed
@jrajahalme jrajahalme deleted the proxy-ack-ports-from-add-listener-callback branch December 4, 2024 10:30
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Dec 7, 2024
Take a proxy port reference for new redirects as soon as they are
created, rather than when an acknowledgement is received. This fixes the
case where a delayed release of the proxy port takes effect after a proxy
redirect has been created but not acknowledged yet, signified by these
error logs seen in the CI:

  msg="Created new proxy instance" ProxyPort=19659 id="2424:egress:TCP:64:" l7parser=http ...
  msg="Delayed release of proxy port 19659" id=cilium-http-egress ...
  msg="Datapath proxy redirection cannot be enabled, L7 proxy may be bypassed" error="ackProxyPort: zero port on cilium-http-egress not allowed" id="2424:egress:TCP:4096:" l7parser=http ...

This change makes it necessary to manage the acknowledgement status of a
proxy port separately from the 'nRedirects' field. Add a new
'acknowledged' field for this purpose.

Fixes: cilium#36142

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Dec 7, 2024
Take a proxy port reference for new redirects as soon as they are
created, rather than when an acknowledgment is received. This fixes the
case where a delayed release of the proxy port takes effect after a proxy
redirect has been created but not acknowledged yet, signified by these
error logs seen in the CI:

  msg="Created new proxy instance" ProxyPort=19659 id="2424:egress:TCP:64:" l7parser=http ...
  msg="Delayed release of proxy port 19659" id=cilium-http-egress ...
  msg="Datapath proxy redirection cannot be enabled, L7 proxy may be bypassed" error="ackProxyPort: zero port on cilium-http-egress not allowed" id="2424:egress:TCP:4096:" l7parser=http ...

This change makes it necessary to manage the acknowledgment status of a
proxy port separately from the 'nRedirects' field. Add a new
'acknowledged' field for this purpose.

Fixes: cilium#36142

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Dec 7, 2024
Take a proxy port reference for new redirects as soon as they are
created, rather than when an acknowledgment is received. This fixes the
case where a delayed release of the proxy port takes effect after a proxy
redirect has been created but not acknowledged yet, signified by these
error logs seen in the CI:

  msg="Created new proxy instance" ProxyPort=19659 id="2424:egress:TCP:64:" l7parser=http ...
  msg="Delayed release of proxy port 19659" id=cilium-http-egress ...
  msg="Datapath proxy redirection cannot be enabled, L7 proxy may be bypassed" error="ackProxyPort: zero port on cilium-http-egress not allowed" id="2424:egress:TCP:4096:" l7parser=http ...

This change makes it necessary to manage the acknowledgment status of a
proxy port separately from the 'nRedirects' field. Add a new
'acknowledged' field for this purpose.

Fixes: cilium#36142

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Dec 7, 2024
Take a proxy port reference for new redirects as soon as they are
created, rather than when an acknowledgment is received. This fixes the
case where a delayed release of the proxy port takes effect after a proxy
redirect has been created but not acknowledged yet, signified by these
error logs seen in the CI:

  msg="Created new proxy instance" ProxyPort=19659 id="2424:egress:TCP:64:" l7parser=http ...
  msg="Delayed release of proxy port 19659" id=cilium-http-egress ...
  msg="Datapath proxy redirection cannot be enabled, L7 proxy may be bypassed" error="ackProxyPort: zero port on cilium-http-egress not allowed" id="2424:egress:TCP:4096:" l7parser=http ...

This change makes it necessary to manage the acknowledgment status of a
proxy port separately from the 'nRedirects' field. Add a new
'acknowledged' field for this purpose.

Fixes: cilium#36142

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Dec 9, 2024
Take a proxy port reference for new redirects as soon as they are
created, rather than when an acknowledgment is received. This fixes the
case where a delayed release of the proxy port takes effect after a proxy
redirect has been created but not acknowledged yet, signified by these
error logs seen in the CI:

  msg="Created new proxy instance" ProxyPort=19659 id="2424:egress:TCP:64:" l7parser=http ...
  msg="Delayed release of proxy port 19659" id=cilium-http-egress ...
  msg="Datapath proxy redirection cannot be enabled, L7 proxy may be bypassed" error="ackProxyPort: zero port on cilium-http-egress not allowed" id="2424:egress:TCP:4096:" l7parser=http ...

This change makes it necessary to manage the acknowledgment status of a
proxy port separately from the 'nRedirects' field. Add a new
'acknowledged' field for this purpose.

Fixes: #36142

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Dec 9, 2024
[ upstream commit d79ab31 ]

Take a proxy port reference for new redirects as soon as they are
created, rather than when an acknowledgment is received. This fixes the
case where a delayed release of the proxy port takes effect after a proxy
redirect has been created but not acknowledged yet, signified by these
error logs seen in the CI:

  msg="Created new proxy instance" ProxyPort=19659 id="2424:egress:TCP:64:" l7parser=http ...
  msg="Delayed release of proxy port 19659" id=cilium-http-egress ...
  msg="Datapath proxy redirection cannot be enabled, L7 proxy may be bypassed" error="ackProxyPort: zero port on cilium-http-egress not allowed" id="2424:egress:TCP:4096:" l7parser=http ...

This change makes it necessary to manage the acknowledgment status of a
proxy port separately from the 'nRedirects' field. Add a new
'acknowledged' field for this purpose.

Fixes: cilium#36142

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme mentioned this pull request Dec 9, 2024
5 tasks
@jrajahalme jrajahalme added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Dec 9, 2024
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Dec 9, 2024
[ upstream commit d79ab31 ]

Take a proxy port reference for new redirects as soon as they are
created, rather than when an acknowledgment is received. This fixes the
case where a delayed release of the proxy port takes effect after a proxy
redirect has been created but not acknowledged yet, signified by these
error logs seen in the CI:

  msg="Created new proxy instance" ProxyPort=19659 id="2424:egress:TCP:64:" l7parser=http ...
  msg="Delayed release of proxy port 19659" id=cilium-http-egress ...
  msg="Datapath proxy redirection cannot be enabled, L7 proxy may be bypassed" error="ackProxyPort: zero port on cilium-http-egress not allowed" id="2424:egress:TCP:4096:" l7parser=http ...

This change makes it necessary to manage the acknowledgment status of a
proxy port separately from the 'nRedirects' field. Add a new
'acknowledged' field for this purpose.

Fixes: cilium#36142

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Dec 9, 2024
[ upstream commit d79ab31 ]

Take a proxy port reference for new redirects as soon as they are
created, rather than when an acknowledgment is received. This fixes the
case where a delayed release of the proxy port takes effect after a proxy
redirect has been created but not acknowledged yet, signified by these
error logs seen in the CI:

  msg="Created new proxy instance" ProxyPort=19659 id="2424:egress:TCP:64:" l7parser=http ...
  msg="Delayed release of proxy port 19659" id=cilium-http-egress ...
  msg="Datapath proxy redirection cannot be enabled, L7 proxy may be bypassed" error="ackProxyPort: zero port on cilium-http-egress not allowed" id="2424:egress:TCP:4096:" l7parser=http ...

This change makes it necessary to manage the acknowledgment status of a
proxy port separately from the 'nRedirects' field. Add a new
'acknowledged' field for this purpose.

Fixes: cilium#36142

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme added affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch labels Dec 10, 2024
github-merge-queue bot pushed a commit that referenced this pull request Dec 10, 2024
[ upstream commit d79ab31 ]

Take a proxy port reference for new redirects as soon as they are
created, rather than when an acknowledgment is received. This fixes the
case where a delayed release of the proxy port takes effect after a proxy
redirect has been created but not acknowledged yet, signified by these
error logs seen in the CI:

  msg="Created new proxy instance" ProxyPort=19659 id="2424:egress:TCP:64:" l7parser=http ...
  msg="Delayed release of proxy port 19659" id=cilium-http-egress ...
  msg="Datapath proxy redirection cannot be enabled, L7 proxy may be bypassed" error="ackProxyPort: zero port on cilium-http-egress not allowed" id="2424:egress:TCP:4096:" l7parser=http ...

This change makes it necessary to manage the acknowledgment status of a
proxy port separately from the 'nRedirects' field. Add a new
'acknowledged' field for this purpose.

Fixes: #36142

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Dec 10, 2024
Comment on lines +130 to +133
os.Mkdir(bootstrapDir, 0777)

// Make sure sockets dir exists
os.Mkdir(GetSocketDir(config.runDir), 0777)
Copy link
Member

@joestringer joestringer Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrajahalme should this use defaults.RuntimePathRights instead (0775, ie avoid write access for every user on the node, given that this code runs in production environments)?

I briefly looked at proposing the change myself, but the commit seems to suggest that some magic is necessary in order to run the corresponding tests. Not sure how I would validate that this doesn't immediately break the test again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/fqdn Affects the FQDN policies feature backport/author The backport will be carried out by the author of the PR. backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.16 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

TPROXY iptables-rules for cilium-dns-egress redirect to the wrong port
5 participants