Skip to content

Conversation

chrisohaver
Copy link
Member

1. Why is this pull request needed and what does it do?

ToEndpoints() conversion was filtering out Endpoint deletion tombstones (where the object in the event is not an *api.Endpoint, but rather a cache.DeletedFinalStateUnknown). The tombstones are essentially indicators that the Endpoint was deleted, without a containing a full copy of the Endpoints object - just a key value. ToEndpoints() was returning a nil *object.Endpoints when these were encountered, which caused failures downstream when deleting from the index.

This fix is allows ToEndpoints() to pass the tombstones through, so they can be deleted properly from the clientState.

2. Which issues (if any) are related?

Fixes #3860

3. Which documentation changes (if any) need to be made?

none

4. Does this introduce a backward incompatible change or deprecation?

no

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@miekg
Copy link
Member

miekg commented May 13, 2020

thanks @chrisohaver . Was this knowable at the time when we did the refactoring to make the memory go down?
I think this warrants a copy of your commit message into the source code to warn future devs - possible pointing to code and or documentation this is a thing

@chrisohaver
Copy link
Member Author

Looks like this may have been an issue since CoreDNS 1.2.3, when API object trimming was introduced. I would not recommend reverting back to a version before 1.2.3 to avoid this bug.

This bug should be relatively uncommon to encounter, however some things that could increase occurrence:

  • unstable network resulting in extended periods of disconnect between CoreDNS and k8s API.
  • a short etcd compaction interval in the k8s api server

@chrisohaver
Copy link
Member Author

Was this knowable at the time when we did the refactoring to make the memory go down?

Theoretically? Probably.
Reasonably? Not sure. client-go is complex.

I think this warrants a copy of your commit message into the source code

Makes sense. I'll add it in ToEndpoints().

@@ -44,10 +44,10 @@ type EndpointPort struct {
func EndpointsKey(name, namespace string) string { return name + "." + namespace }

// ToEndpoints converts an api.Endpoints to a *Endpoints.
func ToEndpoints(obj interface{}) (*api.Endpoints, *Endpoints) {
func ToEndpoints(obj interface{}) (*api.Endpoints, interface{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a weird change. I think it would be more readable if we changed the signature to func ToEndpoints(*api.Endpoints) (*Endpoints), do the casting in the caller, and just not call it if obj is not a *api.Endpoints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming as I believe ToEndpoints is only called once anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - i was just making this change. :)

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@codecov-io
Copy link

codecov-io commented May 13, 2020

Codecov Report

Merging #3887 into master will decrease coverage by 0.03%.
The diff coverage is 31.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3887      +/-   ##
==========================================
- Coverage   56.78%   56.75%   -0.04%     
==========================================
  Files         222      222              
  Lines       11305    11316      +11     
==========================================
+ Hits         6420     6422       +2     
- Misses       4391     4398       +7     
- Partials      494      496       +2     
Impacted Files Coverage Δ
plugin/kubernetes/controller.go 46.64% <31.25%> (-0.40%) ⬇️
plugin/file/reload.go 69.44% <0.00%> (-5.56%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f534c3f...94cdf8a. Read the comment docs.

return nil, nil
}

func ToEndpoints(end *api.Endpoints) (*api.Endpoints, *Endpoints) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get why we pass in end and then just return it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, why do we need to return it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, why do we need to return it?

You're right. I should do more refactoring here...

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver
Copy link
Member Author

chrisohaver commented May 13, 2020

I've further jostled the code around to reduce verbosity and scope of changes.
... need to retest it to verify things still work. (checked)

@chrisohaver chrisohaver marked this pull request as draft May 13, 2020 22:01
@chrisohaver chrisohaver marked this pull request as ready for review May 13, 2020 22:53
@zouyee
Copy link
Member

zouyee commented May 14, 2020

@chrisohaver
Copy link
Member Author

Many controllers have examples of handling this scenario

Yes, I have essentially based this change on such examples.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver
Copy link
Member Author

The latest iteration moves the delta object inspection to inside the switch cases. I think it reads better. Note that now it sends the (possibly outdated) endpoints object from the tombstone to store.Delete() instead of passing in the tombstone itself (as is done in client-go's informer processor, from where this was originally copied), but this is ok , because Delete only uses immutable key fields.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
if err := clientState.Delete(tombstone); err != nil {
return err
}
apiEndpoints, ok = tombstone.Obj.(*api.Endpoints)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this, why do we need this part if we already deleted it with the tombstone?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hasn't been deleted yet at this point. It gets deleted later with clientState.Delete(obj). If we don't call clientState.Delete, it will remain in the cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what does clientState.Delete(tombstone) do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arg! - I left that in by accident. Thanks for catching that. I'll remove it and run the test again.

@chrisohaver
Copy link
Member Author

Not sure why I have not seen this for Service objects in my tests - perhaps the API doesn't send tombstones for Service objects, just the endpoints.

I have not tested Pods, but I suspect we'll see this issue there too. But I want to get this PR squared away in a state that is acceptable before replicating it to Pods and Services.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@johnbelamaric
Copy link
Member

Ok, that looks better!

@johnbelamaric johnbelamaric merged commit bb7ee50 into coredns:master May 14, 2020
@ekeih
Copy link

ekeih commented May 15, 2020

Thanks for your work on that @chrisohaver! This fix will make live easier for us, we recently saw this issue a lot on clusters running in Azure VMs.
I don't know how the release cycle/management of coredns usually works. Is there any way to get a patch release in the near future? Or do you release in fixed intervals?

sgreene570 pushed a commit to sgreene570/coredns that referenced this pull request Aug 13, 2020
* check for nil

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* handle tombstone

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move casting to caller. add comments.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more sanding

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more scrubbing

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move object unwraping to switch cases

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* oops remove debug

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* remove cruft

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
sgreene570 pushed a commit to sgreene570/coredns that referenced this pull request Aug 13, 2020
* check for nil

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* handle tombstone

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move casting to caller. add comments.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more sanding

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more scrubbing

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move object unwraping to switch cases

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* oops remove debug

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* remove cruft

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
openshift-merge-robot added a commit to openshift/coredns that referenced this pull request Aug 17, 2020
Bug 1868315: Back port Handle endpoint tombstones (upstream coredns#3887)
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/coredns that referenced this pull request Aug 17, 2020
* check for nil

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* handle tombstone

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move casting to caller. add comments.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more sanding

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more scrubbing

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move object unwraping to switch cases

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* oops remove debug

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* remove cruft

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/coredns that referenced this pull request Aug 17, 2020
* check for nil

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* handle tombstone

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move casting to caller. add comments.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more sanding

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more scrubbing

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move object unwraping to switch cases

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* oops remove debug

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* remove cruft

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
openshift-merge-robot added a commit to openshift/coredns that referenced this pull request Aug 20, 2020
…-to-release-4.5

[release-4.5] Bug 1869309: Back port Handle endpoint tombstones (upstream coredns#3887)
openshift-merge-robot added a commit to openshift/coredns that referenced this pull request Sep 12, 2020
…-to-release-4.4

[release-4.4] Bug 1869310: Back port Handle endpoint tombstones (upstream coredns#3887)
sgreene570 pushed a commit to sgreene570/coredns that referenced this pull request Sep 21, 2020
* check for nil

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* handle tombstone

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move casting to caller. add comments.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more sanding

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more scrubbing

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move object unwraping to switch cases

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* oops remove debug

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* remove cruft

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
nyodas pushed a commit to DataDog/coredns that referenced this pull request Oct 26, 2020
* check for nil

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* handle tombstone

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move casting to caller. add comments.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more sanding

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* more scrubbing

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move object unwraping to switch cases

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* oops remove debug

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* remove cruft

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver chrisohaver deleted the endpoint-tombstone branch January 9, 2021 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CoreDNS pods panic with no probe failures
6 participants