-
Notifications
You must be signed in to change notification settings - Fork 8k
Description
Bug Description
Hello.
I have three primary clusters that were installed using this manual: https://istio.io/latest/docs/setup/install/multicluster/multi-primary/
Everything works as expected but I have a problem with one service (it is couchdb).
For some reason when it restarts other workloads can't connect to it because of the error.
Let's start from the very beginning.
The problem occurs in all clusters but let's do it in the first cluster.
SVC:
➜ ~ k get svc -n kazoo-db --context first | grep "storage-db-svc"
storage-db-svc ClusterIP 10.100.26.163 <none> 5984/TCP,5986/TCP 3d16h
Endpoints from another workload:
➜ ~ istioctl pc endpoint --context first crossbar-844cd68fd9-g8q9c | grep -Ei "5984.*storage-db-svc"
10.1.9.200:5984 HEALTHY OK outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local
10.2.1.109:5984 HEALTHY OK outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local
192.168.58.110:5984 HEALTHY FAILED outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local
The error in the envoy log of crossbar pod:
[2021-12-10T12:36:02.063Z] "GET / HTTP/1.1" 503 UF,URX upstream_reset_before_response_started{connection_failure,TLS_error:_268435581:SSL_routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED} - "TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED" 0 195 45 - "-" "hackney/1.6.4" "9b3d6495-fa43-4054-8d0c-c6f51308b288" "storage-db-svc.kazoo-db:5984" "192.168.58.110:5984" outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local - 10.100.26.163:5984 192.168.46.122:43165 - default
So it looks like we have bad certificate, right? But why? Other workloads work fine between clusters...
Moreover, the 192.168.58.110:5984
address is in the same cluster where we have crossbar-844cd68fd9-g8q9c
. So it is even not between clusters, they are located in the same cluster.
And I have the following DR:
│ spec:
│ host: storage-db-svc.kazoo-db.svc.cluster.local
│ trafficPolicy:
│ loadBalancer:
│ localityLbSetting:
│ enabled: true
│ failoverPriority:
│ - topology.istio.io/network
│ - topology.kubernetes.io/region
│ - topology.kubernetes.io/zone
│ - topology.istio.io/subzone
│ outlierDetection:
│ baseEjectionTime: 30s
│ consecutive5xxErrors: 1
│ interval: 15s
So to fix it I have the following options:
- I can remove any other service from the first cluster and my
storage-db-svc
will work perfectly starting from this point - Disable mTLS in
storage-db-svc
- I can delete any pod from any cluster in the mesh and then
storage-db-svc
will work - I can add any another service in the mesh and then
storage-db-svc
will work - and so on... any changes trigger
storage-db-svc
to work properly
Let's remove some another service from the first cluster:
➜ ~ istioctl pc endpoint --context first crossbar-844cd68fd9-g8q9c | grep -Ei "5984.*storage-db-svc"
10.1.9.200:5984 HEALTHY OK outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local
10.2.1.109:5984 HEALTHY OK outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local
192.168.58.110:5984 HEALTHY FAILED outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local
➜ ~ k delete svc -n smb-cluster1 --context first web-mgmt
service "web-mgmt" deleted
➜ ~ istioctl pc endpoint --context first crossbar-844cd68fd9-g8q9c | grep -Ei "5984.*storage-db-svc"
10.1.9.200:5984 HEALTHY OK outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local
10.2.1.109:5984 HEALTHY OK outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local
192.168.58.110:5984 HEALTHY OK outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local
So as you can see after removing another service that is not related to any of those pods/endpoints helped...
So crossbar pod now sends requests to the closest storage as expected:
[2021-12-10T12:47:57.646Z] "GET /accounts/_design/accounts/_view/listing_by_descendants?endkey=%5b%227f466bd0c4bd7dffb5914222e0cd0987%22%2c%7b%7d%5d&limit=51&startkey=%5b%227f466bd0c4bd7dffb5914222e0cd0987%22%2c%22%22%5d HTTP/1.1" 200 - via_upstream - "-" 0 302 141 140 "-" "hackney/1.6.4" "d9c1f39f-2c33-4780-8934-a30a287d2a09" "storage-db-svc.kazoo-db:5984" "192.168.58.110:5984" outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local 192.168.46.122:48840 10.100.26.163:5984 192.168.46.122:55931 - default
[2021-12-10T12:47:58.938Z] "GET / HTTP/1.1" 200 - via_upstream - "-" 0 182 18 18 "-" "hackney/1.6.4" "f7dbd329-7ccd-4aea-a069-6a407eb1d4b0" "storage-db-svc.kazoo-db:5984" "192.168.58.110:5984" outbound|5984||storage-db-svc.kazoo-db.svc.cluster.local 192.168.46.122:48910 10.100.26.163:5984 192.168.46.122:34079 - default
Could you please explain why it works in this way? What have I missed?
Version
➜ ~ istioctl version
client version: 1.12.0
control plane version: 1.12.0
data plane version: 1.12.0 (15 proxies)
➜ ~ k version
h Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:08:39Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.23) and server (1.21) exceeds the supported minor version skew of +/-1
➜ ~ h version
version.BuildInfo{Version:"v3.7.2", GitCommit:"663a896f4a815053445eec4153677ddc24a0a361", GitTreeState:"clean", GoVersion:"go1.17.3"}
### Additional Information
_No response_