-
Notifications
You must be signed in to change notification settings - Fork 3.4k
policy: Keep NameManager locked during SelectorCache operations #10501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
policy: Keep NameManager locked during SelectorCache operations #10501
Conversation
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who owns the fqdnSelector
here? Is it guaranteed to be retired if the len(f.users)
goes to zero? Or can another goroutine find it and add a user while the lock is no longer held and the selectorManager
is calling the callback?
Also, I think it would be helpful to future readers if the expectations around the cb
were more clearly defined in comments somewhere here. It appears to be a Finally
type function, similar in some ways to some of the regeneration code, where if this CB is non-null then the callers must call it when they're finished executing their own logic.
@jrajahalme Did you have specific rationale around |
@joestringer Changed the label(s), marking this for backport to 1.6 and 1.7 as well. |
if user count gets to zero, the
Right, must be called if non- |
9af49ca
to
563231e
Compare
test-me-please |
563231e
to
73d10ba
Compare
test-me-please |
73d10ba
to
a6c7362
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small nit, nothing blocking from me.
ad77a6d
to
8d67100
Compare
@raybejjani Changed interface function names to have "Locked" suffix and discovered one unlocked call site while doing that, thanks! |
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One specific concern below around refcounting and CIDR identities, did you check on this before?
8d67100
to
5bf90eb
Compare
delete(d.selectors, selector) | ||
} | ||
|
||
func (d *DummyIdentityNotifier) InjectIdentitiesForSelector(fqdnSel api.FQDNSelector, ids []identity.NumericIdentity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported method DummyIdentityNotifier.InjectIdentitiesForSelector should have comment or be unexported
selectors map[api.FQDNSelector][]identity.NumericIdentity | ||
} | ||
|
||
func NewDummyIdentityNotifier() *DummyIdentityNotifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported function NewDummyIdentityNotifier should have comment or be unexported
"github.com/cilium/cilium/pkg/policy/api" | ||
) | ||
|
||
type DummyIdentityNotifier struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported type DummyIdentityNotifier should have comment or be unexported
test-me-please |
test-gke K8sFQDN.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
test-gke K8sFQDN.* |
I guess the GKE target isn't working as we'd hope, so for janitors / maintainers we can ignore that failure: "Build building or pushing Cilium images failed" |
UpdateGenerateDNS updated the cache and generate selector updates as two
separate lock-holding sections. When updates happened concurrently, via
DNS lookups, DNS GC, or endpoint restores, the snapshot of data that
eventually updated the selectors could be out-of-order. This created an
A-B-A race for selectors that needed an update and those that needed to
be cleared.
These races have been observed on agent starts with pending pods but may
manifest between multiple pods making DNS updates or the DNS garbage
collector controller and any pod.
The code now holds the NameManager lock for the entirety of
UpdateGenerateDNS and ForceRegenerateDNS. This forces the update to
complete in its entirety before allowing another to modify the FQDN
cache or registered selectors.
The locking order between the NameManager lock and the SelectorCache
lock has been accidental so far. When removing a selector user from
the cache, the SelectorCache lock is taken, and if an FQDN selector
has no more users, the NameManager is notified of this, which
internally takes the NameManager lock. Reverse this by explicitly
locking NamaManager for the duration of selector cache operations.
From now on, if the two locks are to be held at the same time, the
NameManager mutex MUST be taken first, the SelectorCache mutex second,
as the name manager is a higher level construct than the selector
cache.
Exposing 'Lock()' and 'Unlock()' operations in an interface is not
ideal, but it helps make the lockinng order explicit, so it also
serves as documentation on the required locking order.
NameManager now return IDs also if the selector has already been
registered. This should help not get into a state where FQDNs are not
passed to the selector cache in case of a race condition.
This change is