Skip to content

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Jul 30, 2024

Currently, temporary connection disruption can occur on agent restart when Cilium is configured in native routing mode, and WireGuard encryption is enabled, because the list of AllowedIPs gets recreated from scratch upon the reception of the node event for each given remote node, possibly removing entries for valid endpoints that have not yet been discovered at that point through the CiliumEndpoint CRD or the corresponding kvstore representation. This issue, instead, does not affect the current implementation in tunnel mode, as in that case we encrypt encapsulated traffic, which always has source and destination addresses corresponding to Node Internal IPs, that are immediately added as Allowed IPs.

Let's prevent this issue restoring the list of Allowed IPs for each peer from the WireGuard state after agent restart and preserving them until ipcache synchronization has been completed. At that point, we do a final GC pass to clean up possible stale entries for pods that got removed while the given agent was down. Special logic is introduced
to prevent a possible flipping behavior in case upon restore an allowed IP is associated with a given peer, but gets associated with a different one later on. Indeed, WireGuard enforces that any allowed IP is associated to at most a single peer.

Fixes: #31979

Fix possible connection disruption on agent restart with WireGuard + native routing

@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch feature/wireguard Relates to Cilium's Wireguard feature affects/v1.15 This issue affects v1.15 branch needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch affects/v1.16 This issue affects v1.16 branch labels Jul 30, 2024
@giorio94
Copy link
Member Author

/test

@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 marked this pull request as ready for review August 21, 2024 07:57
@giorio94 giorio94 requested review from a team as code owners August 21, 2024 07:57
@giorio94 giorio94 requested review from brb, asauber and doniacld August 21, 2024 07:57
@giorio94 giorio94 requested a review from asauber August 22, 2024 07:08
@giorio94 giorio94 force-pushed the mio/wireguard-restore-allowed-ips branch from 8736161 to 85e5114 Compare August 22, 2024 16:41
@giorio94
Copy link
Member Author

Rebased and fix a conflict with #34373

@giorio94
Copy link
Member Author

/test

2 similar comments
@auriaave
Copy link
Contributor

/test

@giorio94
Copy link
Member Author

/test

Copy link
Member

@asauber asauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing stands out to me here that would hold this back

Copy link
Contributor

@doniacld doniacld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for sig-agent

@brb brb requested a review from gandro August 28, 2024 13:17
@brb brb removed their request for review August 28, 2024 13:17
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find and great fix, thanks so much!

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Sep 3, 2024
@gandro gandro removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 3, 2024
@giorio94 giorio94 force-pushed the mio/wireguard-restore-allowed-ips branch from 85e5114 to 73df40a Compare September 3, 2024 11:47
Currently, temporary connection disruption can occur on agent restart
when Cilium is configured in native routing mode, and WireGuard
encryption is enabled, because the list of AllowedIPs gets recreated
from scratch upon the reception of the node event for each given remote
node, possibly removing entries for valid endpoints that have not yet
been discovered at that point through the CiliumEndpoint CRD or the
corresponding kvstore representation. This issue, instead, does not
affect the current implementation in tunnel mode, as in that case we
encrypt encapsulated traffic, which always has source and destination
addresses corresponding to Node Internal IPs, that are immediately
added as Allowed IPs.

Let's prevent this issue restoring the list of Allowed IPs for each peer
from the WireGuard state after agent restart and preserving them until
ipcache synchronization has been completed. At that point, we do a
final GC pass to clean up possible stale entries for pods that got
removed while the given agent was down. Special logic is introduced
to prevent a possible flipping behavior in case upon restore an allowed
IP is associated with a given peer, but gets associated with a different
one later on. Indeed, WireGuard enforces that any allowed IP is
associated to at most a single peer.

Fixes: 31979
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/wireguard-restore-allowed-ips branch from 73df40a to 5ba874d Compare September 3, 2024 11:48
@giorio94
Copy link
Member Author

giorio94 commented Sep 3, 2024

/test

@giorio94 giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 3, 2024
@julianwiedmann julianwiedmann added this pull request to the merge queue Sep 4, 2024
Merged via the queue into main with commit 5e5be1c Sep 4, 2024
289 checks passed
@julianwiedmann julianwiedmann deleted the mio/wireguard-restore-allowed-ips branch September 4, 2024 14:40
jrife added a commit to jrife/cilium that referenced this pull request Sep 5, 2024
Recent flakes in the "Network performance GKE" tests, in particular
those that use wireguard, uncovered an issue where updates to a peer's
allowed IPs can cause connectivity issues. In the best case this may
just result in dropped packets while in the worst case may lead to
failed calls to sendto/sendmsg (see ciliumGH-33159 for more detail on some of
the symptoms observed while running netperf).

This issue happens when ReplaceAllowedIPs is set, as it causes the
wireguard driver to completely clear the set of allowed IPs for a peer
before the set is rebuilt. This leaves a gap where packets sent or
received at that time will appear to not have a peer associated with
them. The current implementation sets ReplaceAllowedIPs every time
updatePeerByConfig is called, but we can avoid ever needing to use it.

This commit uses a "dummy peer" as a way to remove allowed IPs from a
peer rather than using ReplaceAllowedIPs. We use peerConfig to track any
pending IP inserts and removals and split these into two different sets
of updates in updatePeerByConfig. The first call to ConfigureDevice()
adds any IPs that are pending insertion. Next, any IPs that are pending
removal are moved to the "dummy peer" before removing the dummy peer
altogether, the net effect being that the IP is removed from the
original peer's list of allowed IPs.

In order to preserve the behavior introduced by ciliumGH-34095 ("wireguard:
restore allowedIPs upon agent restart") this commit adds
syncPeersWithDevice which removes any stale peers or allowed IPs on
initialization. This commit mostly sidesteps the issue that ciliumGH-34095 set
out to solve since we only ever explitly remove allowed IPs, meaning we
can do away with a lot of the accounting previously done in
restoredPeers and restoredAllowedIPs.

Fixes: cilium#33159

Signed-off-by: Jordan Rife <jrife@google.com>
jrife added a commit to jrife/cilium that referenced this pull request Sep 5, 2024
Recent flakes in the "Network performance GKE" tests, in particular
those that use wireguard, uncovered an issue where updates to a peer's
allowed IPs can cause connectivity issues. In the best case this may
just result in dropped packets while in the worst case may lead to
failed calls to sendto/sendmsg (see ciliumGH-33159 for more detail on some of
the symptoms observed while running netperf).

This issue happens when ReplaceAllowedIPs is set, as it causes the
wireguard driver to completely clear the set of allowed IPs for a peer
before the set is rebuilt. This leaves a gap where packets sent or
received at that time will appear to not have a peer associated with
them. The current implementation sets ReplaceAllowedIPs every time
updatePeerByConfig is called, but we can avoid ever needing to use it.

This commit uses a "dummy peer" as a way to remove allowed IPs from a
peer rather than using ReplaceAllowedIPs. We use peerConfig to track any
pending IP inserts and removals and split these into two different sets
of updates in updatePeerByConfig. The first call to ConfigureDevice()
adds any IPs that are pending insertion. Next, any IPs that are pending
removal are moved to the "dummy peer" before removing the dummy peer
altogether, the net effect being that the IP is removed from the
original peer's list of allowed IPs.

This commit mostly sidesteps the issue that ciliumGH-34095 ("wireguard:
restore allowedIPs upong agent restart") set out to solve since we only
ever explicitly remove allowed IPs, meaning we can do away with a lot of
the accounting previously done by restoredPeers and restoredAllowedIPs.
However, we retain the behavior in RestoreFinished() that removes stale
peers and allowed IPs.

Fixes: cilium#33159

Signed-off-by: Jordan Rife <jrife@google.com>
jrife added a commit to jrife/cilium that referenced this pull request Sep 5, 2024
Recent flakes in the "Network performance GKE" tests, in particular
those that use wireguard, uncovered an issue where updates to a peer's
allowed IPs can cause connectivity issues. In the best case this may
just result in dropped packets while in the worst case may lead to
failed calls to sendto/sendmsg (see ciliumGH-33159 for more detail on some of
the symptoms observed while running netperf).

This issue happens when ReplaceAllowedIPs is set, as it causes the
wireguard driver to completely clear the set of allowed IPs for a peer
before the set is rebuilt. This leaves a gap where packets sent or
received at that time will appear to not have a peer associated with
them. The current implementation sets ReplaceAllowedIPs every time
updatePeerByConfig is called, but we can avoid ever needing to use it.

This commit uses a "dummy peer" as a way to remove allowed IPs from a
peer rather than using ReplaceAllowedIPs. We use peerConfig to track any
pending IP inserts and removals and split these into two different sets
of updates in updatePeerByConfig. The first call to ConfigureDevice()
adds any IPs that are pending insertion. Next, any IPs that are pending
removal are moved to the "dummy peer" before removing the dummy peer
altogether, the net effect being that the IP is removed from the
original peer's list of allowed IPs.

This commit mostly sidesteps the issue that ciliumGH-34095 ("wireguard:
restore allowedIPs upong agent restart") set out to solve since we only
ever explicitly remove allowed IPs, meaning we can do away with a lot of
the accounting previously done by restoredPeers and restoredAllowedIPs.
However, we retain the behavior in RestoreFinished() that removes stale
peers and allowed IPs.

Fixes: cilium#33159

Signed-off-by: Jordan Rife <jrife@google.com>
@nbusseneau nbusseneau mentioned this pull request Sep 11, 2024
13 tasks
@nbusseneau nbusseneau added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Sep 11, 2024
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Sep 13, 2024
jrife added a commit to jrife/cilium that referenced this pull request Sep 19, 2024
Recent flakes in the "Network performance GKE" tests, in particular
those that use wireguard, uncovered an issue where updates to a peer's
allowed IPs can cause connectivity issues. In the best case this may
just result in dropped packets while in the worst case may lead to
failed calls to sendto/sendmsg (see ciliumGH-33159 for more detail on some of
the symptoms observed while running netperf).

This issue happens when ReplaceAllowedIPs is set, as it causes the
wireguard driver to completely clear the set of allowed IPs for a peer
before the set is rebuilt. This leaves a gap where packets sent or
received at that time will appear to not have a peer associated with
them. The current implementation sets ReplaceAllowedIPs every time
updatePeerByConfig is called, but we can avoid ever needing to use it.

This commit uses a "dummy peer" as a way to remove allowed IPs from a
peer rather than using ReplaceAllowedIPs. We use peerConfig to track any
pending IP inserts and removals and split these into two different sets
of updates in updatePeerByConfig. The first call to ConfigureDevice()
adds any IPs that are pending insertion. Next, any IPs that are pending
removal are moved to the "dummy peer" before removing the dummy peer
altogether, the net effect being that the IP is removed from the
original peer's list of allowed IPs.

This commit mostly sidesteps the issue that ciliumGH-34095 ("wireguard:
restore allowedIPs upong agent restart") set out to solve since we only
ever explicitly remove allowed IPs, meaning we can do away with a lot of
the accounting previously done by restoredPeers and restoredAllowedIPs.
However, we retain the behavior in RestoreFinished() that removes stale
peers and allowed IPs.

Fixes: cilium#33159

Signed-off-by: Jordan Rife <jrife@google.com>
github-merge-queue bot pushed a commit that referenced this pull request Sep 26, 2024
Recent flakes in the "Network performance GKE" tests, in particular
those that use wireguard, uncovered an issue where updates to a peer's
allowed IPs can cause connectivity issues. In the best case this may
just result in dropped packets while in the worst case may lead to
failed calls to sendto/sendmsg (see GH-33159 for more detail on some of
the symptoms observed while running netperf).

This issue happens when ReplaceAllowedIPs is set, as it causes the
wireguard driver to completely clear the set of allowed IPs for a peer
before the set is rebuilt. This leaves a gap where packets sent or
received at that time will appear to not have a peer associated with
them. The current implementation sets ReplaceAllowedIPs every time
updatePeerByConfig is called, but we can avoid ever needing to use it.

This commit uses a "dummy peer" as a way to remove allowed IPs from a
peer rather than using ReplaceAllowedIPs. We use peerConfig to track any
pending IP inserts and removals and split these into two different sets
of updates in updatePeerByConfig. The first call to ConfigureDevice()
adds any IPs that are pending insertion. Next, any IPs that are pending
removal are moved to the "dummy peer" before removing the dummy peer
altogether, the net effect being that the IP is removed from the
original peer's list of allowed IPs.

This commit mostly sidesteps the issue that GH-34095 ("wireguard:
restore allowedIPs upong agent restart") set out to solve since we only
ever explicitly remove allowed IPs, meaning we can do away with a lot of
the accounting previously done by restoredPeers and restoredAllowedIPs.
However, we retain the behavior in RestoreFinished() that removes stale
peers and allowed IPs.

Fixes: #33159

Signed-off-by: Jordan Rife <jrife@google.com>
jibi pushed a commit that referenced this pull request Sep 30, 2024
[ upstream commit 3ebfa4d ]

[ backporter's note: minor conflict in agent_test.go due to missing iter
 packeage in go 1.22 ]

Recent flakes in the "Network performance GKE" tests, in particular
those that use wireguard, uncovered an issue where updates to a peer's
allowed IPs can cause connectivity issues. In the best case this may
just result in dropped packets while in the worst case may lead to
failed calls to sendto/sendmsg (see GH-33159 for more detail on some of
the symptoms observed while running netperf).

This issue happens when ReplaceAllowedIPs is set, as it causes the
wireguard driver to completely clear the set of allowed IPs for a peer
before the set is rebuilt. This leaves a gap where packets sent or
received at that time will appear to not have a peer associated with
them. The current implementation sets ReplaceAllowedIPs every time
updatePeerByConfig is called, but we can avoid ever needing to use it.

This commit uses a "dummy peer" as a way to remove allowed IPs from a
peer rather than using ReplaceAllowedIPs. We use peerConfig to track any
pending IP inserts and removals and split these into two different sets
of updates in updatePeerByConfig. The first call to ConfigureDevice()
adds any IPs that are pending insertion. Next, any IPs that are pending
removal are moved to the "dummy peer" before removing the dummy peer
altogether, the net effect being that the IP is removed from the
original peer's list of allowed IPs.

This commit mostly sidesteps the issue that GH-34095 ("wireguard:
restore allowedIPs upong agent restart") set out to solve since we only
ever explicitly remove allowed IPs, meaning we can do away with a lot of
the accounting previously done by restoredPeers and restoredAllowedIPs.
However, we retain the behavior in RestoreFinished() that removes stale
peers and allowed IPs.

Fixes: #33159

Signed-off-by: Jordan Rife <jrife@google.com>
github-merge-queue bot pushed a commit that referenced this pull request Oct 1, 2024
[ upstream commit 3ebfa4d ]

[ backporter's note: minor conflict in agent_test.go due to missing iter
 packeage in go 1.22 ]

Recent flakes in the "Network performance GKE" tests, in particular
those that use wireguard, uncovered an issue where updates to a peer's
allowed IPs can cause connectivity issues. In the best case this may
just result in dropped packets while in the worst case may lead to
failed calls to sendto/sendmsg (see GH-33159 for more detail on some of
the symptoms observed while running netperf).

This issue happens when ReplaceAllowedIPs is set, as it causes the
wireguard driver to completely clear the set of allowed IPs for a peer
before the set is rebuilt. This leaves a gap where packets sent or
received at that time will appear to not have a peer associated with
them. The current implementation sets ReplaceAllowedIPs every time
updatePeerByConfig is called, but we can avoid ever needing to use it.

This commit uses a "dummy peer" as a way to remove allowed IPs from a
peer rather than using ReplaceAllowedIPs. We use peerConfig to track any
pending IP inserts and removals and split these into two different sets
of updates in updatePeerByConfig. The first call to ConfigureDevice()
adds any IPs that are pending insertion. Next, any IPs that are pending
removal are moved to the "dummy peer" before removing the dummy peer
altogether, the net effect being that the IP is removed from the
original peer's list of allowed IPs.

This commit mostly sidesteps the issue that GH-34095 ("wireguard:
restore allowedIPs upong agent restart") set out to solve since we only
ever explicitly remove allowed IPs, meaning we can do away with a lot of
the accounting previously done by restoredPeers and restoredAllowedIPs.
However, we retain the behavior in RestoreFinished() that removes stale
peers and allowed IPs.

Fixes: #33159

Signed-off-by: Jordan Rife <jrife@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch affects/v1.16 This issue affects v1.16 branch backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. feature/wireguard Relates to Cilium's Wireguard feature kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible connectivity disruption on agent restart with WireGuard + native routing
7 participants