-
Notifications
You must be signed in to change notification settings - Fork 3.4k
ipcache: move CIDR restoration to asynchronous APIs #28673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I had an idea for how we might do this, and it only took a few hours to sketch out and implement. Looking for feedback, especially from @joestringer @jrajahalme |
/test |
Not that this is ready to go, but CI is green :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable approach in general, though I suspect in the details the current draft isn't quite working as intended due to the lack of unreserve before allocation, see feedback below.
I might suggest splitting the identity allocation logic & tests into a dedicated commit for easier subsequent review & to keep small commits, but that's as much personal preference as anything.
f8c0dcd
to
9c71941
Compare
@joestringer I updated the PR based on your feedback. No big changes, just some small cleanups. The biggest change is explicitly managing reservation lifetime. I also renamed the metadata-bits from RestoredIdentity to RequestedIdentity to better match semantics. |
/test |
9c71941
to
8c0c7e8
Compare
/test |
8c0c7e8
to
5a0e3b8
Compare
5a0e3b8
to
63f584a
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back through, logically it seems sound to me. I spaced out a little bit going through the main restoreIPCache()
logic patch due to the combination of refactor + logical change, so that could do with a revisit. I did notice some discrepancies there that need to be addressed, and ideally that should be split into two patches. Other than that it's mostly minor nits around function comments and logs, and one unnecessary goroutine. The corner case handling of identity allocation from the reserved range when under identity pressure wasn't entirely clear to me either. We can discuss further in the threads below.
7136bcd
to
1720bbf
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was stress-testing this PR with FQDN proxy and found a nasty bug where prefixes were incorrectly removed from the BPF ipcache while still in use with FQDN. The problem fqdn proxy does not do an UpsertGeneratedIdentities() if an identity is not new, and thus the source of a prefix is not "upgraded".
Luckily, it was an easy fix: since the set of all referenced identities is passed to UpsertGeneratedIdentities(), we can overwrite createdFromMetadata retroactively and prevent deletion.
I've added a new commit and test case for this scenario.
Nice find. A couple of questions:
- What does the stress testing look like? I think it would be valuable to upstream it at some point so when we have an infrastructure for stress testing in the future, we have the FQDN proxy covered.
- Does this bug exist in
main
or was it something that arose from what this PR is solving? If it's the former, then I would suggest a separate PR so that it can be backported separately from this PR. Additionally, do we have an idea on how we missed it before?
Nothing exciting, just ensuring that fqdn policies still work while spamming restarts. We have a test case that covers this already, it just needs to be updated to wait for the restore grace period to stop.
I don't think so. The specific interaction that's problematic is
Moving CIDR restoration over to the new API is what triggered this. That said, I'm sure there's some permutation that I'm not thinking of, so it can't hurt to split out. |
Test failures look interesting. One of them is a bug in my new test, the other is ensuring fqdn works after restart. The latter is failing, which is concerning. Works On My Machine, of course. |
1720bbf
to
15e2d35
Compare
Found another missing case where we were stepping on toes, this time when allocating identities after a restart. Phew. That should fix it. |
/test |
15e2d35
to
97f18f9
Compare
CI was green except a broken Travis. Rebased on main. |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit to fix up.
They do the same thing as their scalar equivalents, but can perform updates to multiple prefixes with a single lock, which saves contention. Also, batch up calls to enqueuePrefixUpdates() where relevant. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This allows local numeric identities to be held out of the pool for allocation. Instead, they can only be used when explicitly requested, e.g. when supplied via the oldNID parameter to Allocate() The purpose of this is to allow for a stable prefix -> nID mapping, even on agent restarts. As part of the startup process, existing identities will be reserved so that existing prefixes can "claim" them (in a subsequent commit). Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This adds an additional ipcache metadata field: the preferred numeric identity for a prefix. The local allocators already allow for requesting a given numeric ID for an identity (set of labels). This is useful when restoring prefix -> identity mappings on daemon restart, where we would like to end up with the same ipcache state as before the restart. When allocating an identity for a prefix with RestoredIdentity, the ipcache will now request that identity. The allocators already handle the case where a requested numeric ID is taken -- the request is merely ignored. So this is always safe to do. Likewise, if a set of labels is already allocated, the requested ID is ignored. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Now that we can withhold and request specific numeric identities, we can transition the ipcache restoration logic over. This change makes it so that prefixes request whatever previous numeric identity they had on restart. Once the watchers are synchronized, the ipcache will proceed with label injection, and the state should exactly match that what was before. Additionally, update some tests that mimicked the now-removed restoration logic. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This parameter is no longer needed, so let's reduce the confusion surface. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This adds an additional check to an existing test that ensures local (cidr) identities are stable after agent restart. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
97f18f9
to
9f2927a
Compare
/test |
All approvals in, only one non-required CI job failing, MLH has added ready-to-merge. Merging. |
This PR contains a key insight: we don't need to insert restored CIDRs directly in to the ipcache. Rather, we just need to ensure their identities are not used by any other prefixes.
This PR accomplishes this by adding a new identity reservation mechanism in to the local allocator. It then plumbs up the ability to request a reserved (i.e. restored) identity when inserting in to the ipcache. The primary benefit of this approach is that we don't have to try and back-fill labels based on a dump of the ipcache. Rather, we just reserve already-existing numeric identities and let all the usual ipcache writers do their thing. When the initial k8s sync is finished and ipcache writing is allowed, then all the restored identities should have the exact same labels they had before agent restart - and all without any fancy logic!
This PR has one significant immediate benefit: the
reserved:kube-apiserver
label no longer flaps on restart 🎉 .The previous flow:
(cidr:xxxx, ... reserved:world)
. At this point, a numeric ID that previously had the kube-apiserver label has now lost it.reserved:kube-apiserver
label back to the IP.ipcache.UpsertLabels()
, which causes the kube-apiserver cidr to allocate a new numeric ID. This is because the set of labels allocated for the cidr didn't include the kube-apiserver label.The new flow:
ipcache.UpsertLabels()
, which will allocate an identity with the same set of labels as before, and use the previous numeric ID for that prefix.As a bonus, we can remove the special hack for re-creating node-cidr labels. That was added in
resolveIdentity()
but it really didn't belong there.If we decide this is the right way forward, outstanding items are: