Skip to content

Conversation

gentoo-root
Copy link
Contributor

snat_v4_new_mapping creates a RevSNAT entry, then a SNAT entry for new connections. However, we also have a fallback mechanism in snat_v4_rev_nat_handle_mapping, which restores the SNAT entry if it was evicted by LRU, but the original RevSNAT entry still exists.

There is a hypothesis that the above two functions may race with each other, and the following is possible:

  1. snat_v4_new_mapping creates a RevSNAT entry.
  2. snat_v4_rev_nat_handle_mapping finds the RevSNAT entry and "restores" the SNAT entry (which has never existed yet).
  3. snat_v4_new_mapping goes on to create the SNAT entry, corresponding to the RevSNAT entry, created previously.
  4. snat_v4_new_mapping fails because the SNAT entry already exists, created by snat_v4_rev_nat_handle_mapping.
  5. snat_v4_new_mapping then removes the RevSNAT entry, trying to revert everything.

The right outcome in this case would be to keep the SNAT entry, whoever created it first, so it makes sense not to fail on -EEXIST.

It also makes the fallback mechanism more robust, because there is a time gap between __snat_lookup and __snat_create.

Ref: #35304
Ref: #37747
Fixes: #38466

Fix "No mapping for NAT masquerade" flakes in the CI, make NAT LRU fallbacks more robust.

@gentoo-root gentoo-root added release-note/bug This PR fixes an issue in a previous release of Cilium. release-note/ci This PR makes changes to the CI. labels Aug 6, 2025
@gentoo-root
Copy link
Contributor Author

/ci-e2e-upgrade

@gentoo-root
Copy link
Contributor Author

/test

@gentoo-root gentoo-root force-pushed the pr/max/fix-nat-mapping branch from fbea40c to 143b7d7 Compare August 7, 2025 10:09
snat_v4_new_mapping creates a RevSNAT entry, then a SNAT entry for new
connections. However, we also have a fallback mechanism in
snat_v4_rev_nat_handle_mapping, which restores the SNAT entry if it was
evicted by LRU, but the original RevSNAT entry still exists.

There is a hypothesis that the above two functions may race with each
other, and the following is possible:

1. snat_v4_new_mapping creates a RevSNAT entry.
2. snat_v4_rev_nat_handle_mapping finds the RevSNAT entry and "restores"
   the SNAT entry (which has never existed yet).
3. snat_v4_new_mapping goes on to create the SNAT entry, corresponding
   to the RevSNAT entry, created previously.
4. snat_v4_new_mapping fails because the SNAT entry already exists,
   created by snat_v4_rev_nat_handle_mapping.
5. snat_v4_new_mapping then removes the RevSNAT entry, trying to revert
   everything.

The right outcome in this case would be to keep the SNAT entry, whoever
created it first, so it makes sense not to fail on -EEXIST.

It also makes the fallback mechanism more robust, because there is a
time gap between __snat_lookup and __snat_create.

Ref: #35304
Ref: #37747
Fixes: #38466
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Fixes: ac5198f ("bpf: Add unit tests for SNAT port allocation")
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
@gentoo-root gentoo-root force-pushed the pr/max/fix-nat-mapping branch from 143b7d7 to c19f24f Compare August 7, 2025 10:12
@gentoo-root
Copy link
Contributor Author

/test

@gentoo-root gentoo-root marked this pull request as ready for review August 7, 2025 17:08
@gentoo-root gentoo-root requested a review from a team as a code owner August 7, 2025 17:08
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 12, 2025
@aanm aanm added this pull request to the merge queue Aug 13, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 13, 2025
@gentoo-root
Copy link
Contributor Author

@aanm, could you re-add it to the merge queue please? Looks like there was a flake when building images in the merge queue CI.

@aanm aanm added this pull request to the merge queue Aug 14, 2025
Merged via the queue into main with commit 73a0d5a Aug 14, 2025
359 of 360 checks passed
@aanm aanm deleted the pr/max/fix-nat-mapping branch August 14, 2025 05:38
@gentoo-root gentoo-root added needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 20, 2025
@gyutaeb
Copy link
Contributor

gyutaeb commented Aug 21, 2025

@gentoo-root Thanks for the follow-up on #37747. This makes the SNAT LRU fallback path more robust and addresses the race around __snat_create. Much appreciated!

@pippolo84 pippolo84 mentioned this pull request Aug 25, 2025
17 tasks
@pippolo84 pippolo84 added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 25, 2025
@pippolo84 pippolo84 mentioned this pull request Aug 25, 2025
7 tasks
@pippolo84 pippolo84 added the backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. label Aug 25, 2025
@pippolo84 pippolo84 removed the needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch label Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Cilium E2E Upgrade (ci-e2e-upgrade): Found unexpected packet drops: "No mapping for NAT masquerade"
5 participants