Skip to content

plugin/kubernetes: coredns container can not resolve new created k8s service after running 2 months #3587

@njuicsgz

Description

@njuicsgz

What happened:
I have 5 coredns pod running about 2 months, then I found 2 of them can not lookup new services created by k8s.

~/allen# kubectl -n core-dns-wlcb-prod get svc,po -o wide
NAME           TYPE        CLUSTER-IP   EXTERNAL-IP     PORT(S)                   AGE       SELECTOR
svc/core-dns   ClusterIP   172.29.8.8   192.168.2.184   49153/TCP,53/UDP,53/TCP   88d       name=core-dns

NAME                                 READY     STATUS    RESTARTS   AGE       IP              NODE
po/core-dns-bb32dbc62e244ff-6zzb5    1/1       Running   1          67d       172.16.84.14    192.168.2.41
po/core-dns-bb32dbc62e244ff-89rdx    1/1       Running   1          67d       172.16.117.10   192.168.2.203
po/core-dns-bb32dbc62e244ff-pbfqm    1/1       Running   1          67d       172.16.84.15    192.168.2.41
po/core-dns-bb32dbc62e244ff-rfqbt    1/1       Running   1          67d       172.16.202.12   192.168.2.79
po/core-dns-bb32dbc62e244ff-vx5dx    1/1       Running   1          67d       172.16.83.20    192.168.2.184

the problem pod (172.16.83.20)

root@dns-check-e5b984de89ac472-dd98j:/# dig @172.16.83.20 core.titan-dev-open.svc.cluster.local

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @172.16.83.20 core.titan-dev-open.svc.cluster.local
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 23894
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;core.titan-dev-open.svc.cluster.local. IN A

;; AUTHORITY SECTION:
cluster.local.          5       IN      SOA     ns.dns.cluster.local. hostmaster.cluster.local. 1578707575 7200 1800 86400 5

;; Query time: 0 msec
;; SERVER: 172.16.83.20#53(172.16.83.20)
;; WHEN: Sat Jan 11 15:38:37 CST 2020
;; MSG SIZE  rcvd: 159

pod(172.16.202.12) works well

root@dns-check-e5b984de89ac472-dd98j:/# dig @172.16.202.12 core.titan-dev-open.svc.cluster.local

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @172.16.202.12 core.titan-dev-open.svc.cluster.local
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55436
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;core.titan-dev-open.svc.cluster.local. IN A

;; ANSWER SECTION:
core.titan-dev-open.svc.cluster.local. 5 IN A   172.29.82.96

;; Query time: 1 msec
;; SERVER: 172.16.202.12#53(172.16.202.12)
;; WHEN: Sat Jan 11 15:38:57 CST 2020
;; MSG SIZE  rcvd: 119

the bad pod (172.16.83.20) is not always bad, it can resolve kubernetes.default and other services created before. However it can not resolve new created k8s service. So, it seems there is some thing wrong for syncing machinism.

I got so many domain resolve logs but no helpful logs of k8s related.

What you expected to happen:

    1. coredns worker normally, can syncing all changes from k8s
    1. and I also want to know how I can get coredns logs about k8s resource syncing?

How to reproduce it (as minimally and precisely as possible):
I think restart pod could resolve the problem, but still not find the steps to reproduce

Anything else we need to know?:

Environment:

  • the version of CoreDNS: 1.6.4
  • Corefile:
.:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          endpoint http://xxxx.k8s.com:8080
          pods insecure
          upstream
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . 10.1.1.1 {
          max_fails 2
          expire 10s
          policy sequential
          health_check 1s          
        }
        log {
          class denial
          class error
        }
        cache 30
        loop
        reload
        loadbalance
    }

  • logs, if applicable:
  • OS (e.g: cat /etc/os-release): Ubuntu 16.04
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions