Proxy configuration getting replicated to Gateway

### Bug Description

We came across an unexpected bug last week in our production cluster. We are still trying to replicate and figuring out the root cause of the issue.

I would be happy to list down the possible configuration as directed, but I would quote the series of events that happened.

Our istiod setup is placed in istio-system namespace in kubernetes using `istioctl`.

We have application specific namespace, let us call this as `demo`.

We have our `ingress-gateway` setup up in `demo` namespace where we are running a micro-service with `istio-proxy` injected.

The `ingress-gateway` was set with requestCpu as 10 and limit as 12 on a 16 core cluster.

`istio-proxy` is setup using pod annotations with request cpu as 2 and limit as 4 using istio annotations. The concurrency was set using `proxy.istio.io/config: "concurrency: 0"`.

Now we wanted to test if increasing the concurrency of istio proxy would give us some advantage. So for setting this up, we took a dump of one of the pods and brought up another pod (not using deployment) and changed the configuration to use

`proxy.istio.io/config: "concurrency: 4"`

This was supposed to be the control bucket of our experiment. But to our surprise, another pod of `ingress-gateway` that came up because of auto-scaling after bringing this pod also copied this configuration and came up with concurrency: 4 which affected our traffic and auto-scaling logic a lot.

This is because when all the `ingress-gateway` pods restarted they all came up with `concurrency: 4` on a 16 core machine. The HPA was set to use 80% of the CPU but on a 16 core machine the concurrency 4 means usage of only 25% of the CPU and hence the ingress-gateway pods stopped auto-scaling thus impacting our production traffic.

As per our investigation, we also asked ourselves few questions which I am listing down here and the answer too.

Q. Did we change the concurrency in some previous configuration which led to this?
A. No we didn't change any configuration in `ingress-gateway` setup.

<img width="1626" alt="Screenshot 2022-08-14 at 3 29 51 PM" src="https://user-images.githubusercontent.com/6075379/184531903-7e5bc17e-15a3-4cf9-a346-39af8566ebae.png">

As we can see from the above image, a pod came as a result of auto-scaling at 2:35 (starting of the graph) and it started using all the 10 cores as expected.

The changes we did (of bringing a standalone pod), was done at around 2:51-55 AM (not shown in the graph) but as per logs.

Thereafter we can see that the pod that came at around 3:30 AM was using 4 cores only.


Q. How can you say that you didn't change any configuration in the gateway setup, during this time period?
A. This is because if you see the deployment id of the pod that came at 3:30 AM is same as of the pods that were running before 3:30 AM. If we would have made configuration changes before this time, it would have changed the deployment id `6589f7f755`


Q. How can you say, this was not a manually spawned pod?
A. If you see the image below, we can see that the old pods with the same deployment id are using 10 cores, while the new pods that are coming up after 3:30 AM are coming up with concurrency 4.

<img width="1677" alt="Screenshot 2022-08-14 at 3 10 41 PM" src="https://user-images.githubusercontent.com/6075379/184531277-660bef3a-0394-4577-9631-8f4df17ee7f4.png">

We did actually made a change in our setup somewhere around `15:50` where all the pods were restarted, but the change was just to remove the request limits from the ingress-gateway pods. But this restarted old pods and the new pods that came up were using 4 cores. But since we didn't deliberately made the concurrency change of 4 cores in gateway we were clueless of what happened suddenly.

Q. How serious is this issue?
A. This to me seems like a critical issue, if this is actually a bug with istio setup. Because one of the purposes of kubernetes / containers is that we get what we expect. If the things would have changed between the deployment id, it would have still made some sense, but with the same deployment id if there are some changes with the configuration which has not been deliberately done this makes the system fragile and non predictable.

Another thing to note here is that even if this is the default way how istio works like may be using defaults, then behaviour of one staging / test pod can affect the production traffic at large scale which may again be a risky bet to play.

Q. How can you ensure that the configuration did actually change, may be the new pods are just using 4 cores by the traffic being redirected to them?
A. Before posting this, we did an analysis of our cloud provider logs ourselves. It may not be possible for me to add the full information here, since this would expose our organisation full kubernetes config, but I can post the parts of the same.

This is the configuration of the pod that came at 3:30 AM
<img width="1223" alt="Screenshot 2022-08-14 at 3 21 31 PM" src="https://user-images.githubusercontent.com/6075379/184531642-c7d2dcc9-3054-42e6-94bf-718b4cf37162.png">

This is the snippet of the configuration of the pod that came at around 2:30 AM
<img width="1223" alt="Screenshot 2022-08-14 at 3 22 25 PM" src="https://user-images.githubusercontent.com/6075379/184531668-109c6d18-7eeb-457e-9647-c74e060cd0e6.png">

Configuration change done in the pod done at around 2:51 AM
<img width="1223" alt="Screenshot 2022-08-14 at 3 24 11 PM" src="https://user-images.githubusercontent.com/6075379/184531717-435c3d6e-fcf2-4a7d-80d8-87be45cda0bc.png">


I hope all the above configuration, images and information will help the istio team to look into the issue. I would be happy to provider more information if possible.

I tried to replicate the issue on my local with a test setup, but I couldn't. There are so many environment differences, like the app is different, istio version is different (since I am using M1 Mac I am using 1.16.alpha version on my local), while in production I am using 1.14.3. Hence I was not able to replicate still I am trying to check if I can, I will post on that.


### Version

```prose
$ kubectl version --short                        
Client Version: v1.21.1
Server Version: v1.21.12-gke.2200
```

Istio Version: 1.14.3
```


### Additional Information

_No response_

### Affected product area

- [ ] Docs
- [ ] Installation
- [X] Networking
- [X] Performance and Scalability
- [ ] Extensions and Telemetry
- [ ] Security
- [X] Test and Release
- [ ] User Experience
- [ ] Developer Infrastructure
- [ ] Upgrade
- [ ] Multi Cluster
- [ ] Virtual Machine
- [ ] Control Plane Revisions

### Is this the right place to submit this?

- [X] This is not a security vulnerability
- [X] This is not a question about how to use Istio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proxy configuration getting replicated to Gateway #40445

Bug Description

Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proxy configuration getting replicated to Gateway #40445

Description

Bug Description

Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions