Skip to content

Explosion of errors & latency after few hours of sustained production #3643

@arbfay

Description

@arbfay

Version & Environment

Redpanda version: (use rpk version): v21.11.3 (rev b3e78b1)

Producer is a Rust program that uses rust-rdkafka (wrapper for librdkafka, at v1.8.2). Produces ~200msg/s sent to 14 topics (1 partition, no replication, not compact), with the following settings:

  • security.protocol: "sasl_plaintext"
  • sasl.mechanism: "SCRAM-SHA-256"
  • message.timeout.ms: "50"
  • queue.buffering.max.ms: "1"
  • enable.idempotence: "true"
  • message.send.max.retries: "10"
  • retry.backoff.ms: "1"

The Redpanda cluster is a single broker in production mode running on Ubuntu based r6gd.large (arm64) AWS instance with enabled idempotency and enforced SASL. The producer works on a close, other instance in the same subnet (which makes IMO network issues the unlikely cause, see below).

What went wrong?

In an image:

Screenshot 2022-01-28 at 10 43 27

After hours, the number of errors explodes at both the producer & the broker. After having reproduced this several times, and tried different settings for the producer, I came to the conclusion that the problem is the broker, thus redpanda.

At first I was using an older version of redpanda and compact topics, but I reproduced this with the latest redpanda version and with 14 single partition, non-compact topics.

What should have happened instead?

Should be staying stable and work as usual.

How to reproduce the issue?

  1. Start a single node cluster on ARM64 machine with idempotency and SASL enabled
  2. Start a producer with settings and behaviour as described above
  3. Let it run for several hours (sometimes it happens early, sometimes it needs +10hrs)

Additional information

Here is what the producer says (many, many times):

ERROR librdkafka > librdkafka: FAIL [thrd:sasl_plaintext://10.2.0.24:9092/bootstrap]: sasl_plaintext://10.2.0.24:9092/1: 3 request(s) timed out: disconnect (after 418567ms in state UP)
ERROR rdkafka::client > librdkafka: Global error: OperationTimedOut (Local: Timed out): sasl_plaintext://10.2.0.24:9092/1: 3 request(s) timed out: disconnect (after 418567ms in state UP)
ERROR rdkafka::client > librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down

What the broker says:

DEBUG 2022-01-27 20:49:34,595 [shard 0] storage-gc - disk_log_impl.cc:236 - [{kafka/topic_1/0}] time retention timestamp: {timestamp: 1642711774595}, first segment max timestamp: {timestamp: 1643252751288}
DEBUG 2022-01-27 20:49:34,595 [shard 0] storage-gc - disk_log_impl.cc:236 - [{kafka/topic_5/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
INFO  2022-01-28 13:00:07,060 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO  2022-01-28 13:00:07,326 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO  2022-01-28 13:00:07,859 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO  2022-01-28 13:00:08,912 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
WARN  2022-01-28 13:01:13,845 [shard 0] rpc - error connecting to 0.0.0.0:33145 - std::__1::system_error (error system:111, Connection refused)

We also observed an sustained increase in memory usage (~1GB/5hrs) with the same workload.

JIRA Link: CORE-824

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions