-
Notifications
You must be signed in to change notification settings - Fork 3.4k
doc:ipsec:kvstore: explicit limitations that could lead to staling XFRM states and no connectivity #39719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
59cc003
to
cbeb3eb
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I provided a few general tips to follow the docs-structure style that we are aiming for in the Cilium docs. I will defer on the technical content to the relevant code owner. See the comments below for more details.
cbeb3eb
to
2e15efa
Compare
Thanks @joestringer. I had updated accordingly to your suggestions, let me know if I can further enhance this 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me for @cilium/docs-structure .
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be a reworded a bit. As it stands, it seems likely to worry unaffected users and to be unactionnable for affected users.
@pchaigno if you are requesting changes, please consider using the "request changes" response in GitHub. This helps to make the intent very clear both to the contributor and also to the bot that sets the ready-to-merge label. |
2e15efa
to
52be236
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot better IMO! One small suggestion below.
IPSec relies on XFRM states and policies to ensure encrypted pod-to-pod connectivity. In the XFRM state: * seq is the incoming sequence number counter. It is used by the kernel to track and validate received packets. When a packet arrives, its sequence number is compared to the expected range to detect replays or out-of-order packets. Together this the oseq value, it helps implement the anti-replay window to prevent attackers from resending previously captured packets. * oseq is the outgoing sequence number counter. It is used when sending packets protected by IPsec. Each outbound IPsec packet is assigned an incrementing oseq value. The oseq ensures unique sequence numbers for each packet, which the receiver uses to validate the order and detect replays. In a SA between two nodes A and B, the seq/oseq values in the XFRM state A must match the oseq/seq values in node B, and vice versa. If that is not the case, users would experience the `XfrmInStateProtoError` error, with no IPSec connectivity between the two nodes. We noticed that a Cilium user might end up in this situation in both the following cases, as stated in the doc changes: 1. KVStore Mode (e.g., etcd): if a Cilium agent connects too late to the newly created KVStore, it may miss the node delete and create events for entries that were restored or reinitialized. This results in staling XFRM state, causing permanent network disruption. 2. KVStore Mode: if a Cilium agent is down for prolonged time, the corresponding node entry in the kvstore will be deleted due to lease expiration (15m), resulting in stale XFRM states. 3. CRD Mode: a similar issue may occur when a CiliumNode resource is deleted and the Cilium agent DaemonSet is restarted. While other agents will recreate fresh XFRM states for the new CiliumNode, the restarted agent may continue to hold obsolete XFRM states referencing all peer nodes. The identified mitigation strategy for these scenario is an IPSec key rotation, which would cause all the states to be consistently recreated in all Cilium agents. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
52be236
to
65a115d
Compare
/test |
IPSec relies on XFRM states and policies to ensure encrypted pod-to-pod connectivity. In the XFRM state:
to track and validate received packets. When a packet arrives, its
sequence number is compared to the expected range to detect replays
or out-of-order packets. Together this the oseq value, it helps implement
the anti-replay window to prevent attackers from resending previously captured packets.
packets protected by IPsec. Each outbound IPsec packet is assigned an
incrementing oseq value. The oseq ensures unique sequence numbers for
each packet, which the receiver uses to validate the order and detect replays.
In a SA between two nodes A and B, the seq/oseq values in the XFRM state A must match the oseq/seq values in node B, and vice versa. If that is not the case, users would experience the
XfrmInStateProtoError
error, with no IPSec connectivity between the two nodes.We noticed that a Cilium user might end up in this situation in both the following cases, as stated in the doc changes:
newly created KVStore, it may miss the node delete and create events
for entries that were restored or reinitialized. This results in staling
XFRM state, causing permanent network disruption. Please do note that
operations such as etcd removal and recreation are not supported.
and the Cilium agent DaemonSet is restarted. While other agents will
recreate fresh XFRM states for the new CiliumNode, the restarted agent
may continue to hold obsolete XFRM states referencing all peer nodes.
The identified mitigation strategy for these scenario is a proper IPSec key rotation, which would cause all the states to be consistently recreated in all Cilium agents.