policy: Keep NameManager locked during SelectorCache operations #10501

jrajahalme · 2020-03-06T19:41:04Z

UpdateGenerateDNS updated the cache and generate selector updates as two
separate lock-holding sections. When updates happened concurrently, via
DNS lookups, DNS GC, or endpoint restores, the snapshot of data that
eventually updated the selectors could be out-of-order. This created an
A-B-A race for selectors that needed an update and those that needed to
be cleared.

These races have been observed on agent starts with pending pods but may
manifest between multiple pods making DNS updates or the DNS garbage
collector controller and any pod.

The code now holds the NameManager lock for the entirety of
UpdateGenerateDNS and ForceRegenerateDNS. This forces the update to
complete in its entirety before allowing another to modify the FQDN
cache or registered selectors.

The locking order between the NameManager lock and the SelectorCache
lock has been accidental so far. When removing a selector user from
the cache, the SelectorCache lock is taken, and if an FQDN selector
has no more users, the NameManager is notified of this, which
internally takes the NameManager lock. Reverse this by explicitly
locking NamaManager for the duration of selector cache operations.

From now on, if the two locks are to be held at the same time, the
NameManager mutex MUST be taken first, the SelectorCache mutex second,
as the name manager is a higher level construct than the selector
cache.

Exposing 'Lock()' and 'Unlock()' operations in an interface is not
ideal, but it helps make the lockinng order explicit, so it also
serves as documentation on the required locking order.

NameManager now return IDs also if the selector has already been
registered. This should help not get into a state where FQDNs are not
passed to the selector cache in case of a race condition.

This change is

jrajahalme · 2020-03-06T19:41:13Z

test-me-please

joestringer

Who owns the fqdnSelector here? Is it guaranteed to be retired if the len(f.users) goes to zero? Or can another goroutine find it and add a user while the lock is no longer held and the selectorManager is calling the callback?

Also, I think it would be helpful to future readers if the expectations around the cb were more clearly defined in comments somewhere here. It appears to be a Finally type function, similar in some ways to some of the regeneration code, where if this CB is non-null then the callers must call it when they're finished executing their own logic.

coveralls · 2020-03-06T20:05:22Z

Coverage increased (+0.002%) to 45.64% when pulling 5bf90eb on pr/jrajahalme/policy-selector-cache-locking-order into 8ea7239 on master.

pkg/policy/selectorcache.go

joestringer · 2020-03-06T20:12:49Z

@jrajahalme Did you have specific rationale around release-note/minor here? This doesn't seem like it's user-facing so would be more suited to release-note/misc.

jrajahalme · 2020-03-06T20:39:45Z

@joestringer Changed the label(s), marking this for backport to 1.6 and 1.7 as well.

jrajahalme · 2020-03-06T20:57:46Z

@joestringer

Who owns the fqdnSelector here? Is it guaranteed to be retired if the len(f.users) goes to zero? Or can another goroutine find it and add a user while the lock is no longer held and the selectorManager is calling the callback?

if user count gets to zero, the fqdnSelector is removed from the selector cache while holding the lock like before. Only the callback to the name manager is postponed, so the removed selector is not reachable from the cache at that time. I.e., no behavioral change as far as selectors in the cache are concerned.

Also, I think it would be helpful to future readers if the expectations around the cb were more clearly defined in comments somewhere here. It appears to be a Finally type function, similar in some ways to some of the regeneration code, where if this CB is non-null then the callers must call it when they're finished executing their own logic.

Right, must be called if non-nil. Will add comments :-)

jrajahalme · 2020-03-06T22:07:04Z

test-me-please

pkg/ipcache/cidr.go

jrajahalme · 2020-03-08T23:21:06Z

test-me-please

jrajahalme · 2020-03-09T13:50:13Z

test-me-please

raybejjani

Just a small nit, nothing blocking from me.

pkg/fqdn/name_manager.go

jrajahalme · 2020-03-09T16:39:26Z

@raybejjani Changed interface function names to have "Locked" suffix and discovered one unlocked call site while doing that, thanks!

jrajahalme · 2020-03-09T16:39:35Z

test-me-please

joestringer

One specific concern below around refcounting and CIDR identities, did you check on this before?

pkg/policy/selectorcache_test.go

pkg/fqdn/name_manager.go

houndci-bot · 2020-03-09T23:45:24Z

pkg/testutils/identitynotifier.go

+	delete(d.selectors, selector)
+}
+
+func (d *DummyIdentityNotifier) InjectIdentitiesForSelector(fqdnSel api.FQDNSelector, ids []identity.NumericIdentity) {


exported method DummyIdentityNotifier.InjectIdentitiesForSelector should have comment or be unexported

houndci-bot · 2020-03-09T23:45:24Z

pkg/testutils/identitynotifier.go

+	selectors map[api.FQDNSelector][]identity.NumericIdentity
+}
+
+func NewDummyIdentityNotifier() *DummyIdentityNotifier {


exported function NewDummyIdentityNotifier should have comment or be unexported

houndci-bot · 2020-03-09T23:45:25Z

pkg/testutils/identitynotifier.go

+	"github.com/cilium/cilium/pkg/policy/api"
+)
+
+type DummyIdentityNotifier struct {


exported type DummyIdentityNotifier should have comment or be unexported

jrajahalme · 2020-03-09T23:45:29Z

test-me-please

joestringer · 2020-03-09T23:47:37Z

test-gke K8sFQDN.*

joestringer

LGTM

ianvernon · 2020-03-10T02:50:19Z

test-gke K8sFQDN.*

joestringer · 2020-03-10T04:28:27Z

I guess the GKE target isn't working as we'd hope, so for janitors / maintainers we can ignore that failure:

"Build building or pushing Cilium images failed"
https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/131/

jrajahalme added wip release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Mar 6, 2020

jrajahalme requested a review from a team March 6, 2020 19:41

joestringer reviewed Mar 6, 2020

View reviewed changes

raybejjani reviewed Mar 6, 2020

View reviewed changes

pkg/policy/selectorcache.go Outdated Show resolved Hide resolved

jrajahalme force-pushed the pr/jrajahalme/policy-selector-cache-locking-order branch from 9af49ca to 563231e Compare March 6, 2020 22:06

jrajahalme requested a review from a team March 6, 2020 22:06

jrajahalme force-pushed the pr/jrajahalme/policy-selector-cache-locking-order branch from 563231e to 73d10ba Compare March 8, 2020 23:19

jrajahalme requested a review from a team as a code owner March 8, 2020 23:19

houndci-bot reviewed Mar 8, 2020

View reviewed changes

pkg/ipcache/cidr.go Outdated Show resolved Hide resolved

houndci-bot reviewed Mar 8, 2020

View reviewed changes

pkg/ipcache/cidr.go Outdated Show resolved Hide resolved

jrajahalme force-pushed the pr/jrajahalme/policy-selector-cache-locking-order branch from 73d10ba to a6c7362 Compare March 9, 2020 13:48

jrajahalme changed the title ~~policy: Reverse locking order between selector cache and name manager~~ policy: Keep NameManager locked during SelectorCache operations Mar 9, 2020

raybejjani approved these changes Mar 9, 2020

View reviewed changes

pkg/fqdn/name_manager.go Outdated Show resolved Hide resolved

jrajahalme force-pushed the pr/jrajahalme/policy-selector-cache-locking-order branch from ad77a6d to 8d67100 Compare March 9, 2020 16:38

joestringer reviewed Mar 9, 2020

View reviewed changes

pkg/policy/selectorcache_test.go Outdated Show resolved Hide resolved

pkg/fqdn/name_manager.go Outdated Show resolved Hide resolved

joestringer mentioned this pull request Mar 9, 2020

fqdn: Correct race when the DNS proxy updates FQDN selectors #10489

Closed

joestringer added release-note/bug This PR fixes an issue in a previous release of Cilium. and removed release-note/misc This PR makes changes that have no direct user impact. labels Mar 9, 2020

jrajahalme force-pushed the pr/jrajahalme/policy-selector-cache-locking-order branch from 8d67100 to 5bf90eb Compare March 9, 2020 23:45

jrajahalme requested a review from a team as a code owner March 9, 2020 23:45

houndci-bot reviewed Mar 9, 2020

View reviewed changes

joestringer approved these changes Mar 9, 2020

View reviewed changes

ianvernon approved these changes Mar 10, 2020

View reviewed changes

jrajahalme mentioned this pull request Mar 10, 2020

v1.7 backports 2020-03-10 #10530

Merged

jrajahalme added backport-pending/1.7 and removed needs-backport/1.7 labels Mar 10, 2020

jrajahalme mentioned this pull request Mar 10, 2020

v1.6 backports 2020-03-10 #10532

Merged

jrajahalme added backport-pending/1.6 and removed needs-backport/1.6 labels Mar 10, 2020

joestringer merged commit 0a07c9b into master Mar 10, 2020

joestringer deleted the pr/jrajahalme/policy-selector-cache-locking-order branch March 10, 2020 15:32

joestringer added backport-done/1.7 and removed backport-pending/1.7 labels Mar 10, 2020

jrajahalme mentioned this pull request Jun 9, 2020

DNS egress defaults-closed, but other egress default-open #11767

Closed

policy: Keep NameManager locked during SelectorCache operations #10501

policy: Keep NameManager locked during SelectorCache operations #10501

Uh oh!

Conversation

jrajahalme commented Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrajahalme commented Mar 6, 2020

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

joestringer commented Mar 6, 2020

Uh oh!

jrajahalme commented Mar 6, 2020

Uh oh!

jrajahalme commented Mar 6, 2020

Uh oh!

jrajahalme commented Mar 6, 2020

Uh oh!

Uh oh!

Uh oh!

jrajahalme commented Mar 8, 2020

Uh oh!

jrajahalme commented Mar 9, 2020

Uh oh!

raybejjani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jrajahalme commented Mar 9, 2020

Uh oh!

jrajahalme commented Mar 9, 2020

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

houndci-bot Mar 9, 2020

Choose a reason for hiding this comment

Uh oh!

houndci-bot Mar 9, 2020

Choose a reason for hiding this comment

Uh oh!

houndci-bot Mar 9, 2020

Choose a reason for hiding this comment

Uh oh!

jrajahalme commented Mar 9, 2020

Uh oh!

joestringer commented Mar 9, 2020

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

ianvernon commented Mar 10, 2020

Uh oh!

joestringer commented Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jrajahalme commented Mar 6, 2020 •

edited

Loading

coveralls commented Mar 6, 2020 •

edited

Loading

joestringer commented Mar 10, 2020 •

edited

Loading