-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Description
Bug Description
We came across an unexpected bug last week in our production cluster. We are still trying to replicate and figuring out the root cause of the issue.
I would be happy to list down the possible configuration as directed, but I would quote the series of events that happened.
Our istiod setup is placed in istio-system namespace in kubernetes using istioctl
.
We have application specific namespace, let us call this as demo
.
We have our ingress-gateway
setup up in demo
namespace where we are running a micro-service with istio-proxy
injected.
The ingress-gateway
was set with requestCpu as 10 and limit as 12 on a 16 core cluster.
istio-proxy
is setup using pod annotations with request cpu as 2 and limit as 4 using istio annotations. The concurrency was set using proxy.istio.io/config: "concurrency: 0"
.
Now we wanted to test if increasing the concurrency of istio proxy would give us some advantage. So for setting this up, we took a dump of one of the pods and brought up another pod (not using deployment) and changed the configuration to use
proxy.istio.io/config: "concurrency: 4"
This was supposed to be the control bucket of our experiment. But to our surprise, another pod of ingress-gateway
that came up because of auto-scaling after bringing this pod also copied this configuration and came up with concurrency: 4 which affected our traffic and auto-scaling logic a lot.
This is because when all the ingress-gateway
pods restarted they all came up with concurrency: 4
on a 16 core machine. The HPA was set to use 80% of the CPU but on a 16 core machine the concurrency 4 means usage of only 25% of the CPU and hence the ingress-gateway pods stopped auto-scaling thus impacting our production traffic.
As per our investigation, we also asked ourselves few questions which I am listing down here and the answer too.
Q. Did we change the concurrency in some previous configuration which led to this?
A. No we didn't change any configuration in ingress-gateway
setup.
As we can see from the above image, a pod came as a result of auto-scaling at 2:35 (starting of the graph) and it started using all the 10 cores as expected.
The changes we did (of bringing a standalone pod), was done at around 2:51-55 AM (not shown in the graph) but as per logs.
Thereafter we can see that the pod that came at around 3:30 AM was using 4 cores only.
Q. How can you say that you didn't change any configuration in the gateway setup, during this time period?
A. This is because if you see the deployment id of the pod that came at 3:30 AM is same as of the pods that were running before 3:30 AM. If we would have made configuration changes before this time, it would have changed the deployment id 6589f7f755
Q. How can you say, this was not a manually spawned pod?
A. If you see the image below, we can see that the old pods with the same deployment id are using 10 cores, while the new pods that are coming up after 3:30 AM are coming up with concurrency 4.
We did actually made a change in our setup somewhere around 15:50
where all the pods were restarted, but the change was just to remove the request limits from the ingress-gateway pods. But this restarted old pods and the new pods that came up were using 4 cores. But since we didn't deliberately made the concurrency change of 4 cores in gateway we were clueless of what happened suddenly.
Q. How serious is this issue?
A. This to me seems like a critical issue, if this is actually a bug with istio setup. Because one of the purposes of kubernetes / containers is that we get what we expect. If the things would have changed between the deployment id, it would have still made some sense, but with the same deployment id if there are some changes with the configuration which has not been deliberately done this makes the system fragile and non predictable.
Another thing to note here is that even if this is the default way how istio works like may be using defaults, then behaviour of one staging / test pod can affect the production traffic at large scale which may again be a risky bet to play.
Q. How can you ensure that the configuration did actually change, may be the new pods are just using 4 cores by the traffic being redirected to them?
A. Before posting this, we did an analysis of our cloud provider logs ourselves. It may not be possible for me to add the full information here, since this would expose our organisation full kubernetes config, but I can post the parts of the same.
This is the configuration of the pod that came at 3:30 AM
This is the snippet of the configuration of the pod that came at around 2:30 AM
Configuration change done in the pod done at around 2:51 AM
I hope all the above configuration, images and information will help the istio team to look into the issue. I would be happy to provider more information if possible.
I tried to replicate the issue on my local with a test setup, but I couldn't. There are so many environment differences, like the app is different, istio version is different (since I am using M1 Mac I am using 1.16.alpha version on my local), while in production I am using 1.14.3. Hence I was not able to replicate still I am trying to check if I can, I will post on that.
Version
$ kubectl version --short
Client Version: v1.21.1
Server Version: v1.21.12-gke.2200
Istio Version: 1.14.3
### Additional Information
_No response_
### Affected product area
- [ ] Docs
- [ ] Installation
- [X] Networking
- [X] Performance and Scalability
- [ ] Extensions and Telemetry
- [ ] Security
- [X] Test and Release
- [ ] User Experience
- [ ] Developer Infrastructure
- [ ] Upgrade
- [ ] Multi Cluster
- [ ] Virtual Machine
- [ ] Control Plane Revisions
### Is this the right place to submit this?
- [X] This is not a security vulnerability
- [X] This is not a question about how to use Istio