-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
What happened:
In my k8s cluster, I am using kubernetes
plugin with pods
set to verified
. CoreDNS should resolve pod names like 1-2-3-4.ns.pod.cluster.local.
only for pods existing in given namespace.
I found that from time to time, CoreDNS resolves nonexisting (deleted) pods. I am getting succesful responses for IPs which do not belong to any pods at the time of the query.
What you expected to happen:
CoreDNS should respond with NXDOMAIN
when querying for IP not belonging to any pod.
How to reproduce it (as minimally and precisely as possible):
Unfortunalely, this is not easy to reproduce. I can only reproduce it in production environment, and even there this is not always happening.
Anything else we need to know?:
I spent some time debugging this and I think I found the cause. In informer.go, there is an error check:
obj, err := convert(d.Object.(meta.Object))
if err != nil {
return err
}
which escapes the loop in case of error. One of the errors returned by convert()
is errPodTerminating
which is caused by pods having deletionTimestamp
set. This is normal for a pod to have deletionTimestamp
set in cache.Updated
delta before cache.Deleted
comes.
The problem is that such pod exits the loop and following deltas are not processed. Usually we have only one delta and everything works, but from time to time more deltas are provided by informer and we drop events.
In my case, a cache.Updated
delta is followed by cache.Deleted
delta for the same pod. When this happens, CoreDNS does not ever remove this pod from internal state.
A simple patch resolves the issue and server works as expected:
obj, err := convert(d.Object.(meta.Object))
if err != nil {
if err == errPodTerminating {
continue
}
return err
}
Note that similar check is present in cache.Deleted
case below, so pods containing deletionTimestamp
are removed correctly if they are not preceeded by a delta returning error.
There are more ways of exiting aforementioned loop with error, although I have never encoutered them. I am not sure if we should ever return with error. From my observation returning error just drops deltas silently and nothing happens, watches are not restarted and there are no error messages. Maybe in case of any errors we should just continue
and proces all deltas?
I can provide PR with above patch.
Environment:
- the version of CoreDNS: 1.11.4 (the same code present in master)
- Corefile: (relevant fragment)
cluster.local:53 {
kubernetes {
pods verified
}
errors
loadbalance round_robin
}
- logs, if applicable: CoreDNS emits no logs relevant to this issue
- OS (e.g:
cat /etc/os-release
): docker image based on Alma Linux 9; OS is not relevant