plugin/kubernetes: coredns container can not resolve new created k8s service after running 2 months

**What happened**:
I have 5 coredns pod running about 2 months, then I found 2 of them can not lookup new services created by k8s.
```
~/allen# kubectl -n core-dns-wlcb-prod get svc,po -o wide
NAME           TYPE        CLUSTER-IP   EXTERNAL-IP     PORT(S)                   AGE       SELECTOR
svc/core-dns   ClusterIP   172.29.8.8   192.168.2.184   49153/TCP,53/UDP,53/TCP   88d       name=core-dns

NAME                                 READY     STATUS    RESTARTS   AGE       IP              NODE
po/core-dns-bb32dbc62e244ff-6zzb5    1/1       Running   1          67d       172.16.84.14    192.168.2.41
po/core-dns-bb32dbc62e244ff-89rdx    1/1       Running   1          67d       172.16.117.10   192.168.2.203
po/core-dns-bb32dbc62e244ff-pbfqm    1/1       Running   1          67d       172.16.84.15    192.168.2.41
po/core-dns-bb32dbc62e244ff-rfqbt    1/1       Running   1          67d       172.16.202.12   192.168.2.79
po/core-dns-bb32dbc62e244ff-vx5dx    1/1       Running   1          67d       172.16.83.20    192.168.2.184
```

the problem pod (172.16.83.20) 
```
root@dns-check-e5b984de89ac472-dd98j:/# dig @172.16.83.20 core.titan-dev-open.svc.cluster.local

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @172.16.83.20 core.titan-dev-open.svc.cluster.local
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 23894
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;core.titan-dev-open.svc.cluster.local. IN A

;; AUTHORITY SECTION:
cluster.local.          5       IN      SOA     ns.dns.cluster.local. hostmaster.cluster.local. 1578707575 7200 1800 86400 5

;; Query time: 0 msec
;; SERVER: 172.16.83.20#53(172.16.83.20)
;; WHEN: Sat Jan 11 15:38:37 CST 2020
;; MSG SIZE  rcvd: 159
```

pod(172.16.202.12) works well
```
root@dns-check-e5b984de89ac472-dd98j:/# dig @172.16.202.12 core.titan-dev-open.svc.cluster.local

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @172.16.202.12 core.titan-dev-open.svc.cluster.local
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55436
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;core.titan-dev-open.svc.cluster.local. IN A

;; ANSWER SECTION:
core.titan-dev-open.svc.cluster.local. 5 IN A   172.29.82.96

;; Query time: 1 msec
;; SERVER: 172.16.202.12#53(172.16.202.12)
;; WHEN: Sat Jan 11 15:38:57 CST 2020
;; MSG SIZE  rcvd: 119
```

the bad pod (172.16.83.20) is not always bad, it can resolve kubernetes.default and other services created before. However it can not resolve new created k8s service. So, it seems there is some thing wrong for syncing machinism. 

I got so many domain resolve logs but no helpful logs of k8s related.


**What you expected to happen**:
- 1. coredns worker normally, can syncing all changes from k8s
- 2. and I also want to know how I can get coredns logs about k8s resource syncing?

**How to reproduce it (as minimally and precisely as possible)**:
I think restart pod could resolve the problem, but still not find the steps to reproduce

**Anything else we need to know?**:

**Environment**:

- the version of CoreDNS: 1.6.4
- Corefile:
```
.:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          endpoint http://xxxx.k8s.com:8080
          pods insecure
          upstream
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . 10.1.1.1 {
          max_fails 2
          expire 10s
          policy sequential
          health_check 1s          
        }
        log {
          class denial
          class error
        }
        cache 30
        loop
        reload
        loadbalance
    }

```
- logs, if applicable: 
- OS (e.g: `cat /etc/os-release`): Ubuntu 16.04
- Others:


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

plugin/kubernetes: coredns container can not resolve new created k8s service after running 2 months #3587

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

plugin/kubernetes: coredns container can not resolve new created k8s service after running 2 months #3587

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions