-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Report
AWS MSK getting into high CPU usage and retrieval of metadata not working for Apache Kafka scaler experimental
Expected Behavior
After having everything correctly configured, Keda should have been able to get the metadata for the topics, use it for scaling and not affect MSK itself.
Actual Behavior
No metadata retrieval working giving errors, causing high CPU usage on MSK causing MSK outage. This means scaler is not working as expected.
Steps to Reproduce the Problem
- Add AWS MSK IAM with roleArn based authentication in Apache Kafka scaler. Kafka version on MSK is 3.5.1
2.Sasl is set to aws_msk_iam and tls is set to enable. - Following is the scaled object and triggerauth config:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: abcd-selector-scaler
namespace: apps-abcd
spec:
scaleTargetRef:
name: apps-abcd-selector
pollingInterval: 5 # Optional. Default: 30 seconds
cooldownPeriod: 30 # Optional. Default: 300 seconds
maxReplicaCount: 8 # Optional. Default: 100
minReplicaCount: 2
triggers:
- type: apache-kafka
metadata:
bootstrapServers: abcd-3-public.msk01uswest2.casdas.c6.kafka.us-west-2.amazonaws.com:9198,abcd-1-public.msk01uswest2.casdas.c6.kafka.us-west-2.amazonaws.com:9198,abcd-1-public.msk01uswest2.casdas.c6.kafka.us-west-2.amazonaws.com:9198
consumerGroup: abcd-selector
topic: Abcd.Potential.V1
awsRegion: us-west-2
lagThreshold: '5'
authenticationRef:
name: abcd-selector-trigger
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: abcd-selector-trigger
namespace: apps-abcd
spec:
secretTargetRef:
- parameter: sasl
name: abcd-selector-secret
key: sasl
- parameter: awsRoleArn
name: abcd-selector-secret
key: awsRoleArn
- parameter: tls
name: abcd-selector-secret
key: tls
Logs from KEDA operator
error getting metadata: kafka.(*Client).Metadata: read tcp xxx.xxx.xxx.xxx:42116->xx.xxx.xxx.xxx:9198: i/o timeout
error getting metadata: kafka.(*Client).Metadata: context deadline exceeded
KEDA Version
2.13.0
Kubernetes Version
1.26
Platform
Amazon Web Services
Scaler Details
Apache Kafka scaler (experimental)
Anything else?
This caused a major outage for us since we use shared MSK. This is a big problem for other services that got affected because of this scaler. Even after restart of brokers, the issue remains because Kafka keeps the information about these connections and is taking lot of time to stabilize after that.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status