Skip to content

ipcache: move CIDR restoration to asynchronous APIs #28673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 9, 2023

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Oct 18, 2023

This PR contains a key insight: we don't need to insert restored CIDRs directly in to the ipcache. Rather, we just need to ensure their identities are not used by any other prefixes.

This PR accomplishes this by adding a new identity reservation mechanism in to the local allocator. It then plumbs up the ability to request a reserved (i.e. restored) identity when inserting in to the ipcache. The primary benefit of this approach is that we don't have to try and back-fill labels based on a dump of the ipcache. Rather, we just reserve already-existing numeric identities and let all the usual ipcache writers do their thing. When the initial k8s sync is finished and ipcache writing is allowed, then all the restored identities should have the exact same labels they had before agent restart - and all without any fancy logic!

This PR has one significant immediate benefit: the reserved:kube-apiserver label no longer flaps on restart 🎉 .

The previous flow:

  1. Dump the ipcache, listing (prefix -> id) pairs.
  2. Upsert that prefix -> id directly in the ipcache, guessing at the set of labels (cidr:xxxx, ... reserved:world). At this point, a numeric ID that previously had the kube-apiserver label has now lost it.
  3. Perform a k8s sync, adding the reserved:kube-apiserver label back to the IP.
  4. Do an ipcache.UpsertLabels(), which causes the kube-apiserver cidr to allocate a new numeric ID. This is because the set of labels allocated for the cidr didn't include the kube-apiserver label.

The new flow:

  1. Dump the ipcache, reserving all IDs from allocation
  2. Upsert the prefix in to the ipcache metadata layer, tagging it as having a restored identity.
  3. Perform a k8s sync, adding in all relevant metadata / labels
  4. Do an ipcache.UpsertLabels(), which will allocate an identity with the same set of labels as before, and use the previous numeric ID for that prefix.

As a bonus, we can remove the special hack for re-creating node-cidr labels. That was added in resolveIdentity() but it really didn't belong there.

If we decide this is the right way forward, outstanding items are:

  • Add test that ensures identities remain constant after restarts
  • Clean up code comments
  • Test extensively with FQDN policies

@squeed squeed requested a review from joestringer October 18, 2023 10:29
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 18, 2023
@squeed
Copy link
Contributor Author

squeed commented Oct 18, 2023

I had an idea for how we might do this, and it only took a few hours to sketch out and implement. Looking for feedback, especially from @joestringer @jrajahalme

@squeed
Copy link
Contributor Author

squeed commented Oct 18, 2023

/test

@squeed squeed requested a review from jrajahalme October 18, 2023 12:47
@squeed
Copy link
Contributor Author

squeed commented Oct 19, 2023

Not that this is ready to go, but CI is green :-)

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable approach in general, though I suspect in the details the current draft isn't quite working as intended due to the lack of unreserve before allocation, see feedback below.

I might suggest splitting the identity allocation logic & tests into a dedicated commit for easier subsequent review & to keep small commits, but that's as much personal preference as anything.

@joestringer joestringer added the release-note/misc This PR makes changes that have no direct user impact. label Oct 24, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 24, 2023
@squeed squeed force-pushed the cidr-restore-refactor branch from f8c0dcd to 9c71941 Compare October 26, 2023 12:45
@squeed
Copy link
Contributor Author

squeed commented Oct 26, 2023

@joestringer I updated the PR based on your feedback. No big changes, just some small cleanups. The biggest change is explicitly managing reservation lifetime.

I also renamed the metadata-bits from RestoredIdentity to RequestedIdentity to better match semantics.

@squeed
Copy link
Contributor Author

squeed commented Oct 26, 2023

/test

@squeed squeed force-pushed the cidr-restore-refactor branch from 9c71941 to 8c0c7e8 Compare October 30, 2023 11:41
@squeed
Copy link
Contributor Author

squeed commented Oct 30, 2023

/test

@squeed squeed self-assigned this Oct 31, 2023
@squeed squeed force-pushed the cidr-restore-refactor branch from 8c0c7e8 to 5a0e3b8 Compare October 31, 2023 20:29
@squeed squeed marked this pull request as ready for review October 31, 2023 20:29
@squeed squeed requested review from a team as code owners October 31, 2023 20:29
@squeed squeed requested a review from christarazi October 31, 2023 20:29
@squeed squeed force-pushed the cidr-restore-refactor branch from 5a0e3b8 to 63f584a Compare November 1, 2023 08:54
@squeed
Copy link
Contributor Author

squeed commented Nov 1, 2023

/test

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back through, logically it seems sound to me. I spaced out a little bit going through the main restoreIPCache() logic patch due to the combination of refactor + logical change, so that could do with a revisit. I did notice some discrepancies there that need to be addressed, and ideally that should be split into two patches. Other than that it's mostly minor nits around function comments and logs, and one unnecessary goroutine. The corner case handling of identity allocation from the reserved range when under identity pressure wasn't entirely clear to me either. We can discuss further in the threads below.

@squeed squeed force-pushed the cidr-restore-refactor branch from 7136bcd to 1720bbf Compare November 6, 2023 15:49
@squeed
Copy link
Contributor Author

squeed commented Nov 6, 2023

/test

@squeed squeed removed the request for review from jrajahalme November 6, 2023 16:15
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was stress-testing this PR with FQDN proxy and found a nasty bug where prefixes were incorrectly removed from the BPF ipcache while still in use with FQDN. The problem fqdn proxy does not do an UpsertGeneratedIdentities() if an identity is not new, and thus the source of a prefix is not "upgraded".

Luckily, it was an easy fix: since the set of all referenced identities is passed to UpsertGeneratedIdentities(), we can overwrite createdFromMetadata retroactively and prevent deletion.

I've added a new commit and test case for this scenario.

Nice find. A couple of questions:

  1. What does the stress testing look like? I think it would be valuable to upstream it at some point so when we have an infrastructure for stress testing in the future, we have the FQDN proxy covered.
  2. Does this bug exist in main or was it something that arose from what this PR is solving? If it's the former, then I would suggest a separate PR so that it can be backported separately from this PR. Additionally, do we have an idea on how we missed it before?

@squeed
Copy link
Contributor Author

squeed commented Nov 6, 2023

  1. What does the stress testing look like? I think it would be valuable to upstream it at some point so when we have an infrastructure for stress testing in the future, we have the FQDN proxy covered.

Nothing exciting, just ensuring that fqdn policies still work while spamming restarts. We have a test case that covers this already, it just needs to be updated to wait for the restore grace period to stop.

  1. Does this bug exist in main or was it something that arose from what this PR is solving? If it's the former, then I would suggest a separate PR so that it can be backported separately from this PR. Additionally, do we have an idea on how we missed it before?

I don't think so. The specific interaction that's problematic is

  1. UpsertMetadata
  2. AllocateCIDRs & UpsertGeneratedIdentities
  3. RemoveMetadata

Moving CIDR restoration over to the new API is what triggered this. That said, I'm sure there's some permutation that I'm not thinking of, so it can't hurt to split out.

@squeed
Copy link
Contributor Author

squeed commented Nov 6, 2023

Test failures look interesting. One of them is a bug in my new test, the other is ensuring fqdn works after restart. The latter is failing, which is concerning. Works On My Machine, of course.

@squeed squeed force-pushed the cidr-restore-refactor branch from 1720bbf to 15e2d35 Compare November 6, 2023 20:15
@squeed
Copy link
Contributor Author

squeed commented Nov 6, 2023

Found another missing case where we were stepping on toes, this time when allocating identities after a restart. Phew. That should fix it.

@squeed
Copy link
Contributor Author

squeed commented Nov 6, 2023

/test

@squeed
Copy link
Contributor Author

squeed commented Nov 8, 2023

CI was green except a broken Travis. Rebased on main.

@squeed
Copy link
Contributor Author

squeed commented Nov 8, 2023

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 8, 2023
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit to fix up.

@christarazi christarazi added kind/enhancement This would improve or streamline existing functionality. area/daemon Impacts operation of the Cilium daemon. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Nov 8, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 8, 2023
They do the same thing as their scalar equivalents, but can perform
updates to multiple prefixes with a single lock, which saves contention.

Also, batch up calls to enqueuePrefixUpdates() where relevant.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This allows local numeric identities to be held out of the pool for
allocation. Instead, they can only be used when explicitly requested,
e.g. when supplied via the oldNID parameter to Allocate()

The purpose of this is to allow for a stable prefix -> nID mapping, even
on agent restarts. As part of the startup process, existing identities
will be reserved so that existing prefixes can "claim" them (in a
subsequent commit).

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This adds an additional ipcache metadata field: the preferred numeric
identity for a prefix. The local allocators already allow for requesting
a given numeric ID for an identity (set of labels). This is useful when
restoring prefix -> identity mappings on daemon restart, where we would
like to end up with the same ipcache state as before the restart.

When allocating an identity for a prefix with RestoredIdentity, the
ipcache will now request that identity. The allocators already handle
the case where a requested numeric ID is taken -- the request is merely
ignored. So this is always safe to do. Likewise, if a set of labels is
already allocated, the requested ID is ignored.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Now that we can withhold and request specific numeric identities, we can
transition the ipcache restoration logic over. This change makes it so
that prefixes request whatever previous numeric identity they had on
restart. Once the watchers are synchronized, the ipcache will proceed
with label injection, and the state should exactly match that what was
before.

Additionally, update some tests that mimicked the now-removed restoration
logic.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This parameter is no longer needed, so let's reduce the confusion
surface.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This adds an additional check to an existing test that ensures local
(cidr) identities are stable after agent restart.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed squeed force-pushed the cidr-restore-refactor branch from 97f18f9 to 9f2927a Compare November 8, 2023 19:05
@squeed
Copy link
Contributor Author

squeed commented Nov 8, 2023

/test

@squeed
Copy link
Contributor Author

squeed commented Nov 9, 2023

All approvals in, only one non-required CI job failing, MLH has added ready-to-merge. Merging.

@squeed squeed merged commit ee810d7 into cilium:main Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants