Skip to content

ipam: Fix inconsistent update of CiliumNodes #19923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 7, 2022

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented May 23, 2022

Currently, when the cilium-operator attaches new ENIs to a node, we update the corresponding CiliumNode in two steps: first the .Status, then the .Spec. That can result in an inconsistent state, where the CiliumNode .Spec.IPAM.Pool contains new IP addresses associated with the new ENI, while .Status.ENI.ENIs is still missing the ENI. This inconsistency manifests as a fatal:

level=fatal msg="Error while creating daemon" error="Unable to allocate router IP for family ipv4: failed to associate IP 10.12.14.5 inside CiliumNode: unable to find ENI eni-9ab538c64feb9f59e" subsys=daemon

This inconsistency occurs because the following can happen:

  1. cilium-operator attaches a new ENI to the CiliumNode.
  2. Still at cilium-operator, .Spec is synced with kube-apiserver. The IP pool is updated with a new set of IP addresses and the new ENI.
  3. The agent receives this half-updated CiliumNode.
  4. It allocates an IP address for the router from the pool of IPs attached to the new ENI, using .Spec.IPAM.Pool.
  5. It fails because the new ENI is not listed in the .Status.ENI.ENIs of the CiliumNode object.
  6. At cilium-operator, .Status is updated with the new ENI.

But wait, you said .Status is updated before .Spec in the function you linked? Yes, but we read the state to populate CiliumNode from two separate places (n.ops.manager.instances and n.available) in the syncToAPIServer function and we don't have anything to prevent having a half updated (one place only) state in the middle of the update function. We lock twice, once for each place, instead of once for the while CiliumNode update. So having a half updated state in the middle of the function would technically be the same as updating .Spec first and .Status second.

We can fix this by first creating a snapshot of the pool first, then write the .Status metadata (which may be more recent than the pool snapshot, which is safe, see comment in the source code of this patch), and then write the pool to .Spec. This ensures that the .Status is always updated before .Spec, but at the same time also ensures that .Status is still more recent than .Spec.

Fix race condition leading to inconsistent CiliumNode that can cause the agent to fatal.

@pchaigno pchaigno added kind/bug This is a bug in the Cilium logic. area/eni Impacts ENI based IPAM. sig/ipam labels May 23, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 23, 2022
@pchaigno pchaigno added the release-note/bug This PR fixes an issue in a previous release of Cilium. label May 23, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 23, 2022
@pchaigno pchaigno force-pushed the fix-inconsistent-ciliumnode-state branch 2 times, most recently from 988f255 to efcd365 Compare May 24, 2022 16:02
pchaigno and others added 2 commits June 1, 2022 16:11
Currently, when the cilium-operator attaches new ENIs to a node, we
update the corresponding CiliumNode in two steps: first the .Status,
then the .Spec [1]. That can result in an inconsistent state, where the
CiliumNode .Spec.IPAM.Pool contains new IP addresses associated with the
new ENI, while .Status.ENI.ENIs is still missing the ENI. This
inconsistency manisfests as a fatal:

    level=fatal msg="Error while creating daemon" error="Unable to allocate router IP for family ipv4: failed to associate IP 10.12.14.5 inside CiliumNode: unable to find ENI eni-9ab538c64feb9f59e" subsys=daemon

This inconsistency occurs because the following can happen:
1. cilium-operator attaches a new ENI to the CiliumNode.
2. Still at cilium-operator, .Spec is synced with kube-apiserver. The IP
   pool is updated with a new set of IP addresses and the new ENI.
3. The agent receives this half-updated CiliumNode.
4. It allocates an IP address for the router from the pool of IPs
   attached to the new ENI, using .Spec.IPAM.Pool.
5. It fails because the new ENI is not listed in the .Status.ENI.ENIs of
   the CiliumNode object.
6. At cilium-operator, .Status is updated with the new ENI.

But wait, you said .Status is updated before .Spec in the function you
linked? Yes, but we read the state to populate CiliumNode from two
separate places (n.ops.manager.instances and n.available) in the
syncToAPIServer function and we don't have anything to prevent having a
half updated (one place only) state in the middle of the update function.
We lock twice, once for each place, instead of once for the while
CiliumNode update. So having a half updated state in the middle of the
function would technically be the same as updating .Spec first and
.Status second.

We can fix this by first creating a snapshot of the pool first, then
write the .Status metadata (which may be more recent than the pool
snapshot, which is safe, see comment in the source code of this patch),
and then write the pool to .Spec. This ensures that the .Status is
always updated before .Spec, but at the same time also ensures that
.Status is still more recent than .Spec.

1 - https://github.com/cilium/cilium/blob/v1.12.0-rc2/pkg/ipam/node.go#L966-L1012

Co-authored-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
The `node.Spec.IPAM.Pool` value is always overwritten after the removed
`if` statement, so there is no need to initialize it.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the fix-inconsistent-ciliumnode-state branch from efcd365 to 559251a Compare June 1, 2022 14:59
@gandro
Copy link
Member

gandro commented Jun 1, 2022

I'm taking over the PR from Paul. Instead of trying to move around the mutexes, I decided to follow a slightly different approach where we take a snapshot of the IPAM Pool first. @christarazi would be great if you can take a look.

Comment on lines +973 to +976
// When an IP is removed, this is also safe. IP release is done via
// handshake, where the agent will never use any IP where it has
// acknowledged the release handshake. Therefore, having an already
// released IP in the pool is fine, as the agent will ignore it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hemanthmalla Since you are more familiar with the release logic, could you confirm that this argument is sound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agent will ignore the IPs in the middle of a handshake for allocations. So it's safe. See https://github.com/DataDog/cilium/blob/b5fd71b9f873942764e90e41b39821ea48152832/pkg/ipam/crd.go#L564-L566

@gandro gandro marked this pull request as ready for review June 1, 2022 15:27
@gandro gandro requested review from a team, twpayne and christarazi June 1, 2022 15:27
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Is this fixing #18366?

Copy link
Member Author

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and cleaner than the first approach!

@pchaigno
Copy link
Member Author

pchaigno commented Jun 2, 2022

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sDatapathConfig MonitorAggregation Checks that monitor aggregation restricts notifications

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

@gandro
Copy link
Member

gandro commented Jun 2, 2022

Makes sense to me. Is this fixing #18366?

Not really. It addresses the second part ("there is a short time period with inconsistencies between status and spec, as these are not updated in an atomic fashion"), but #18366 also concerns itself with potential inconsistencies if one of the two updates fails, which could still happen.

This PR here only fixes the issue where .Spec and .Status are inconsistent even if the update to the custom resource succeeded. #18366 is broader, as it is also about dealing with cases where the update to the custom resource itself fails.

It should also be noted that this PR here only works, because currently the metadata in .Status is additive only. If we ever remove information from there without a handshake protocol (as we do for the IP release), there will be inconsistencies again (information could be removed from .Status, but the older snapshot of .Spec (which we create here) might still reference it). But since this PR here is also intended to be backported, I want to keep its impact as minimal as possible.

@aanm aanm merged commit eac0dee into cilium:master Jun 7, 2022
@pchaigno pchaigno deleted the fix-inconsistent-ciliumnode-state branch June 7, 2022 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants