Skip to content

Issue with AWS MSK IAM using Apache Kafka scaler #5531

@sameerjoshinice

Description

@sameerjoshinice

Report

AWS MSK getting into high CPU usage and retrieval of metadata not working for Apache Kafka scaler experimental

Expected Behavior

After having everything correctly configured, Keda should have been able to get the metadata for the topics, use it for scaling and not affect MSK itself.

Actual Behavior

No metadata retrieval working giving errors, causing high CPU usage on MSK causing MSK outage. This means scaler is not working as expected.

Steps to Reproduce the Problem

  1. Add AWS MSK IAM with roleArn based authentication in Apache Kafka scaler. Kafka version on MSK is 3.5.1
    2.Sasl is set to aws_msk_iam and tls is set to enable.
  2. Following is the scaled object and triggerauth config:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: abcd-selector-scaler
  namespace: apps-abcd
spec:
  scaleTargetRef:
    name: apps-abcd-selector
  pollingInterval: 5 # Optional. Default: 30 seconds
  cooldownPeriod: 30 # Optional. Default: 300 seconds
  maxReplicaCount: 8 # Optional. Default: 100
  minReplicaCount: 2
  triggers:
    - type: apache-kafka
      metadata:
        bootstrapServers: abcd-3-public.msk01uswest2.casdas.c6.kafka.us-west-2.amazonaws.com:9198,abcd-1-public.msk01uswest2.casdas.c6.kafka.us-west-2.amazonaws.com:9198,abcd-1-public.msk01uswest2.casdas.c6.kafka.us-west-2.amazonaws.com:9198
        consumerGroup: abcd-selector
        topic: Abcd.Potential.V1
        awsRegion: us-west-2
        lagThreshold: '5'
      authenticationRef:
        name: abcd-selector-trigger

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: abcd-selector-trigger
  namespace: apps-abcd
spec:
  secretTargetRef:
    - parameter: sasl
      name: abcd-selector-secret
      key: sasl
    - parameter: awsRoleArn
      name: abcd-selector-secret
      key: awsRoleArn
    - parameter: tls
      name: abcd-selector-secret
      key: tls

Logs from KEDA operator

error getting metadata: kafka.(*Client).Metadata: read tcp xxx.xxx.xxx.xxx:42116->xx.xxx.xxx.xxx:9198: i/o timeout
error getting metadata: kafka.(*Client).Metadata: context deadline exceeded

KEDA Version

2.13.0

Kubernetes Version

1.26

Platform

Amazon Web Services

Scaler Details

Apache Kafka scaler (experimental)

Anything else?

This caused a major outage for us since we use shared MSK. This is a big problem for other services that got affected because of this scaler. Even after restart of brokers, the issue remains because Kafka keeps the information about these connections and is taking lot of time to stabilize after that.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingstaleAll issues that are marked as stale due to inactivity

Type

No type

Projects

Status

Ready To Ship

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions