wireguard: restore allowedIPs upon agent restart #34095

giorio94 · 2024-07-30T16:43:24Z

Currently, temporary connection disruption can occur on agent restart when Cilium is configured in native routing mode, and WireGuard encryption is enabled, because the list of AllowedIPs gets recreated from scratch upon the reception of the node event for each given remote node, possibly removing entries for valid endpoints that have not yet been discovered at that point through the CiliumEndpoint CRD or the corresponding kvstore representation. This issue, instead, does not affect the current implementation in tunnel mode, as in that case we encrypt encapsulated traffic, which always has source and destination addresses corresponding to Node Internal IPs, that are immediately added as Allowed IPs.

Let's prevent this issue restoring the list of Allowed IPs for each peer from the WireGuard state after agent restart and preserving them until ipcache synchronization has been completed. At that point, we do a final GC pass to clean up possible stale entries for pods that got removed while the given agent was down. Special logic is introduced
to prevent a possible flipping behavior in case upon restore an allowed IP is associated with a given peer, but gets associated with a different one later on. Indeed, WireGuard enforces that any allowed IP is associated to at most a single peer.

Fixes: #31979

Fix possible connection disruption on agent restart with WireGuard + native routing

pkg/wireguard/agent/agent.go

giorio94 · 2024-07-30T17:11:11Z

/test

giorio94 · 2024-08-20T16:41:29Z

/test

pkg/wireguard/agent/agent.go

giorio94 · 2024-08-22T16:42:36Z

Rebased and fix a conflict with #34373

giorio94 · 2024-08-22T16:44:13Z

/test

auriaave · 2024-08-22T17:04:01Z

/test

giorio94 · 2024-08-22T17:13:37Z

/test

asauber

nothing stands out to me here that would hold this back

doniacld

LGTM for sig-agent

gandro

Great find and great fix, thanks so much!

pkg/wireguard/agent/agent.go

Currently, temporary connection disruption can occur on agent restart when Cilium is configured in native routing mode, and WireGuard encryption is enabled, because the list of AllowedIPs gets recreated from scratch upon the reception of the node event for each given remote node, possibly removing entries for valid endpoints that have not yet been discovered at that point through the CiliumEndpoint CRD or the corresponding kvstore representation. This issue, instead, does not affect the current implementation in tunnel mode, as in that case we encrypt encapsulated traffic, which always has source and destination addresses corresponding to Node Internal IPs, that are immediately added as Allowed IPs. Let's prevent this issue restoring the list of Allowed IPs for each peer from the WireGuard state after agent restart and preserving them until ipcache synchronization has been completed. At that point, we do a final GC pass to clean up possible stale entries for pods that got removed while the given agent was down. Special logic is introduced to prevent a possible flipping behavior in case upon restore an allowed IP is associated with a given peer, but gets associated with a different one later on. Indeed, WireGuard enforces that any allowed IP is associated to at most a single peer. Fixes: 31979 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2024-09-03T11:54:36Z

/test

Recent flakes in the "Network performance GKE" tests, in particular those that use wireguard, uncovered an issue where updates to a peer's allowed IPs can cause connectivity issues. In the best case this may just result in dropped packets while in the worst case may lead to failed calls to sendto/sendmsg (see ciliumGH-33159 for more detail on some of the symptoms observed while running netperf). This issue happens when ReplaceAllowedIPs is set, as it causes the wireguard driver to completely clear the set of allowed IPs for a peer before the set is rebuilt. This leaves a gap where packets sent or received at that time will appear to not have a peer associated with them. The current implementation sets ReplaceAllowedIPs every time updatePeerByConfig is called, but we can avoid ever needing to use it. This commit uses a "dummy peer" as a way to remove allowed IPs from a peer rather than using ReplaceAllowedIPs. We use peerConfig to track any pending IP inserts and removals and split these into two different sets of updates in updatePeerByConfig. The first call to ConfigureDevice() adds any IPs that are pending insertion. Next, any IPs that are pending removal are moved to the "dummy peer" before removing the dummy peer altogether, the net effect being that the IP is removed from the original peer's list of allowed IPs. In order to preserve the behavior introduced by ciliumGH-34095 ("wireguard: restore allowedIPs upon agent restart") this commit adds syncPeersWithDevice which removes any stale peers or allowed IPs on initialization. This commit mostly sidesteps the issue that ciliumGH-34095 set out to solve since we only ever explitly remove allowed IPs, meaning we can do away with a lot of the accounting previously done in restoredPeers and restoredAllowedIPs. Fixes: cilium#33159 Signed-off-by: Jordan Rife <jrife@google.com>

Recent flakes in the "Network performance GKE" tests, in particular those that use wireguard, uncovered an issue where updates to a peer's allowed IPs can cause connectivity issues. In the best case this may just result in dropped packets while in the worst case may lead to failed calls to sendto/sendmsg (see ciliumGH-33159 for more detail on some of the symptoms observed while running netperf). This issue happens when ReplaceAllowedIPs is set, as it causes the wireguard driver to completely clear the set of allowed IPs for a peer before the set is rebuilt. This leaves a gap where packets sent or received at that time will appear to not have a peer associated with them. The current implementation sets ReplaceAllowedIPs every time updatePeerByConfig is called, but we can avoid ever needing to use it. This commit uses a "dummy peer" as a way to remove allowed IPs from a peer rather than using ReplaceAllowedIPs. We use peerConfig to track any pending IP inserts and removals and split these into two different sets of updates in updatePeerByConfig. The first call to ConfigureDevice() adds any IPs that are pending insertion. Next, any IPs that are pending removal are moved to the "dummy peer" before removing the dummy peer altogether, the net effect being that the IP is removed from the original peer's list of allowed IPs. This commit mostly sidesteps the issue that ciliumGH-34095 ("wireguard: restore allowedIPs upong agent restart") set out to solve since we only ever explicitly remove allowed IPs, meaning we can do away with a lot of the accounting previously done by restoredPeers and restoredAllowedIPs. However, we retain the behavior in RestoreFinished() that removes stale peers and allowed IPs. Fixes: cilium#33159 Signed-off-by: Jordan Rife <jrife@google.com>

Recent flakes in the "Network performance GKE" tests, in particular those that use wireguard, uncovered an issue where updates to a peer's allowed IPs can cause connectivity issues. In the best case this may just result in dropped packets while in the worst case may lead to failed calls to sendto/sendmsg (see GH-33159 for more detail on some of the symptoms observed while running netperf). This issue happens when ReplaceAllowedIPs is set, as it causes the wireguard driver to completely clear the set of allowed IPs for a peer before the set is rebuilt. This leaves a gap where packets sent or received at that time will appear to not have a peer associated with them. The current implementation sets ReplaceAllowedIPs every time updatePeerByConfig is called, but we can avoid ever needing to use it. This commit uses a "dummy peer" as a way to remove allowed IPs from a peer rather than using ReplaceAllowedIPs. We use peerConfig to track any pending IP inserts and removals and split these into two different sets of updates in updatePeerByConfig. The first call to ConfigureDevice() adds any IPs that are pending insertion. Next, any IPs that are pending removal are moved to the "dummy peer" before removing the dummy peer altogether, the net effect being that the IP is removed from the original peer's list of allowed IPs. This commit mostly sidesteps the issue that GH-34095 ("wireguard: restore allowedIPs upong agent restart") set out to solve since we only ever explicitly remove allowed IPs, meaning we can do away with a lot of the accounting previously done by restoredPeers and restoredAllowedIPs. However, we retain the behavior in RestoreFinished() that removes stale peers and allowed IPs. Fixes: #33159 Signed-off-by: Jordan Rife <jrife@google.com>

[ upstream commit 3ebfa4d ] [ backporter's note: minor conflict in agent_test.go due to missing iter packeage in go 1.22 ] Recent flakes in the "Network performance GKE" tests, in particular those that use wireguard, uncovered an issue where updates to a peer's allowed IPs can cause connectivity issues. In the best case this may just result in dropped packets while in the worst case may lead to failed calls to sendto/sendmsg (see GH-33159 for more detail on some of the symptoms observed while running netperf). This issue happens when ReplaceAllowedIPs is set, as it causes the wireguard driver to completely clear the set of allowed IPs for a peer before the set is rebuilt. This leaves a gap where packets sent or received at that time will appear to not have a peer associated with them. The current implementation sets ReplaceAllowedIPs every time updatePeerByConfig is called, but we can avoid ever needing to use it. This commit uses a "dummy peer" as a way to remove allowed IPs from a peer rather than using ReplaceAllowedIPs. We use peerConfig to track any pending IP inserts and removals and split these into two different sets of updates in updatePeerByConfig. The first call to ConfigureDevice() adds any IPs that are pending insertion. Next, any IPs that are pending removal are moved to the "dummy peer" before removing the dummy peer altogether, the net effect being that the IP is removed from the original peer's list of allowed IPs. This commit mostly sidesteps the issue that GH-34095 ("wireguard: restore allowedIPs upong agent restart") set out to solve since we only ever explicitly remove allowed IPs, meaning we can do away with a lot of the accounting previously done by restoredPeers and restoredAllowedIPs. However, we retain the behavior in RestoreFinished() that removes stale peers and allowed IPs. Fixes: #33159 Signed-off-by: Jordan Rife <jrife@google.com>

giorio94 commented Jul 30, 2024

View reviewed changes

pkg/wireguard/agent/agent.go Show resolved Hide resolved

julianwiedmann mentioned this pull request Aug 7, 2024

Possible connectivity disruption on agent restart with WireGuard + native routing #31979

Closed

2 tasks

giorio94 force-pushed the mio/wireguard-restore-allowed-ips branch from 4e8ac12 to 8736161 Compare August 20, 2024 16:34

giorio94 marked this pull request as ready for review August 21, 2024 07:57

giorio94 requested review from a team as code owners August 21, 2024 07:57

giorio94 requested review from brb, asauber and doniacld August 21, 2024 07:57

asauber reviewed Aug 21, 2024

View reviewed changes

pkg/wireguard/agent/agent.go Show resolved Hide resolved

giorio94 requested a review from asauber August 22, 2024 07:08

giorio94 force-pushed the mio/wireguard-restore-allowed-ips branch from 8736161 to 85e5114 Compare August 22, 2024 16:41

asauber approved these changes Aug 22, 2024

View reviewed changes

doniacld approved these changes Aug 27, 2024

View reviewed changes

brb requested a review from gandro August 28, 2024 13:17

brb removed their request for review August 28, 2024 13:17

gandro approved these changes Sep 3, 2024

View reviewed changes

pkg/wireguard/agent/agent.go Outdated Show resolved Hide resolved

pkg/wireguard/agent/agent.go Show resolved Hide resolved

maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Sep 3, 2024

gandro removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 3, 2024

giorio94 force-pushed the mio/wireguard-restore-allowed-ips branch from 85e5114 to 73df40a Compare September 3, 2024 11:47

giorio94 force-pushed the mio/wireguard-restore-allowed-ips branch from 73df40a to 5ba874d Compare September 3, 2024 11:48

giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 3, 2024

julianwiedmann enabled auto-merge September 4, 2024 14:31

julianwiedmann added this pull request to the merge queue Sep 4, 2024

Merged via the queue into main with commit 5e5be1c Sep 4, 2024
289 checks passed

julianwiedmann deleted the mio/wireguard-restore-allowed-ips branch September 4, 2024 14:40

jrife mentioned this pull request Sep 5, 2024

wireguard: Use dummy peer for allowed IP removal #34612

Merged

nbusseneau mentioned this pull request Sep 11, 2024

v1.16 Backports 2024-09-11 #34831

Merged

13 tasks

nbusseneau added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Sep 11, 2024

github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Sep 13, 2024

cilium-release-bot bot mentioned this pull request Sep 20, 2024

Prepare for release v1.16.2 #34973

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wireguard: restore allowedIPs upon agent restart #34095

wireguard: restore allowedIPs upon agent restart #34095

Uh oh!

giorio94 commented Jul 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

giorio94 commented Jul 30, 2024

Uh oh!

giorio94 commented Aug 20, 2024

Uh oh!

Uh oh!

giorio94 commented Aug 22, 2024

Uh oh!

giorio94 commented Aug 22, 2024

Uh oh!

auriaave commented Aug 22, 2024

Uh oh!

giorio94 commented Aug 22, 2024

Uh oh!

asauber left a comment

Uh oh!

doniacld left a comment

Uh oh!

gandro left a comment

Uh oh!

Uh oh!

Uh oh!

giorio94 commented Sep 3, 2024

Uh oh!

Uh oh!

Uh oh!

wireguard: restore allowedIPs upon agent restart #34095

wireguard: restore allowedIPs upon agent restart #34095

Uh oh!

Conversation

giorio94 commented Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

giorio94 commented Jul 30, 2024

Uh oh!

giorio94 commented Aug 20, 2024

Uh oh!

Uh oh!

giorio94 commented Aug 22, 2024

Uh oh!

giorio94 commented Aug 22, 2024

Uh oh!

auriaave commented Aug 22, 2024

Uh oh!

giorio94 commented Aug 22, 2024

Uh oh!

asauber left a comment

Choose a reason for hiding this comment

Uh oh!

doniacld left a comment

Choose a reason for hiding this comment

Uh oh!

gandro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

giorio94 commented Sep 3, 2024

Uh oh!

Uh oh!

Uh oh!

giorio94 commented Jul 30, 2024 •

edited

Loading