Colocating metrics provider along with the operator causes HPA delays if not configured properly

### Report

At our organization (Flipkart), we manage large k8s clusters where atleast 300 scaledobjects are present. Lately when we onboarded a bunch of new workloads to be managed by scaledobjects we saw substantial degradation to the HPA functionality where scaling events weren't happening inline with changes to corresponding compute metrics. 

A bit of a background here --  recently we upgraded k8s from 1.22.x to 1.25.x and keda from 2.2.x to 2.10.x in order to migrate to HPA v2 as well as move to a newer keda to bring in all the goodness of an upgrade. Post this upgrade everything was fine and the scaledobject count at a cluster level is around 200+ . As mentioned above, when we onboarded a bunch of new scaledobjects (50+ in count) we started seeing substantial degradation to the autoscaling pipeline which impacted all the workloads since they were not scaled out/down in time. Based on our observations it took >10mins in general.

Upon troubleshooting and profiling a bunch of components such as controller manager, k8s metrics server, keda metrics adapter and operator, we saw that the external metrics served by the keda metrics adapter was latent and the p99 latencies were around ~5s. Following our investigation after profiling and code walkthrough we realized that the metrics scraping model has changed in one of the recent releases of keda where the metrics are fetched from the operator (via grpc client) as opposed to the earlier design where the metrics are fetched from the metrics adapter. We saw that in this flow, there are calls being made to the K8s API server to update a certain fallback health status of the scaledobjects. These calls were getting rate limited at the client side causing higher read latencies for the external metrics API. We realized that the k8s clients used by the reconciliation controllers and the metrics scraper are same and come under the default rate limits. Upon increasing these rate limits from the default (20 QPS and 30 burst) the issue is resolved.

While we weren't aware of this design change which sort of merged the scraping component into the operator, we were confused as to why there are k8s API update calls being made as part of the metrics read path. We feel a couple of changes can be made to make keda users aware of deploying and supporting large scale clusters:

1. Update the health status calls only for scaledobjects which have [fallback behaviour](https://keda.sh/docs/2.13/concepts/scaling-deployments/#fallback) defined. In our case there was no fallback defined but still the status update calls were happening which led to this issue.
2. Now that, since k8s client used by reconciliation controllers and scraper components is same, both these flows are sort of coupled without isolation where one flow can swarm and get other flows to get rate limited (client side). Either separate the flows to use different clients or [document and highlight the fact](https://keda.sh/docs/2.13/operate/cluster/#kubernetes-client-parameters) so that users are aware and can configure the ratelimits accordingly in the keda operator.

### Expected Behavior

1.  Perform the health status update calls only for scaledobjects which have [fallback behaviour](https://keda.sh/docs/2.13/concepts/scaling-deployments/#fallback) defined since that status information is useful only for fallback workflow. In our case there was no fallback defined but still the status update calls were happening which led to this issue.
2. Now that, since k8s client used by reconciliation controllers and scraper components is same, both these flows are sort of coupled without isolation where one flow can swarm and get other flows to get rate limited (client side). Either separate the flows to use different clients or [document and highlight the fact](https://keda.sh/docs/2.13/operate/cluster/#kubernetes-client-parameters) so that users are aware and can configure the ratelimits accordingly in the keda operator

### Actual Behavior

Delays in HPA pipelines

### Steps to Reproduce the Problem

Deploy with default configuration and create 300+ scaledobjects.


### KEDA Version

2.10.1

### Kubernetes Version

< 1.26

### Platform

Other

### Scaler Details

_No response_

### Anything else?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Colocating metrics provider along with the operator causes HPA delays if not configured properly #5624

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Colocating metrics provider along with the operator causes HPA delays if not configured properly #5624

Description

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions