-
Notifications
You must be signed in to change notification settings - Fork 671
Description
Version & Environment
Redpanda version: (use rpk version
): v21.11.3 (rev b3e78b1)
Producer is a Rust program that uses rust-rdkafka
(wrapper for librdkafka
, at v1.8.2). Produces ~200msg/s sent to 14 topics (1 partition, no replication, not compact), with the following settings:
security.protocol
: "sasl_plaintext"sasl.mechanism
: "SCRAM-SHA-256"message.timeout.ms
: "50"queue.buffering.max.ms
: "1"enable.idempotence
: "true"message.send.max.retries
: "10"retry.backoff.ms
: "1"
The Redpanda cluster is a single broker in production mode running on Ubuntu based r6gd.large
(arm64) AWS instance with enabled idempotency and enforced SASL. The producer works on a close, other instance in the same subnet (which makes IMO network issues the unlikely cause, see below).
What went wrong?
In an image:
After hours, the number of errors explodes at both the producer & the broker. After having reproduced this several times, and tried different settings for the producer, I came to the conclusion that the problem is the broker, thus redpanda.
At first I was using an older version of redpanda and compact topics, but I reproduced this with the latest redpanda version and with 14 single partition, non-compact topics.
What should have happened instead?
Should be staying stable and work as usual.
How to reproduce the issue?
- Start a single node cluster on ARM64 machine with idempotency and SASL enabled
- Start a producer with settings and behaviour as described above
- Let it run for several hours (sometimes it happens early, sometimes it needs +10hrs)
Additional information
Here is what the producer says (many, many times):
ERROR librdkafka > librdkafka: FAIL [thrd:sasl_plaintext://10.2.0.24:9092/bootstrap]: sasl_plaintext://10.2.0.24:9092/1: 3 request(s) timed out: disconnect (after 418567ms in state UP)
ERROR rdkafka::client > librdkafka: Global error: OperationTimedOut (Local: Timed out): sasl_plaintext://10.2.0.24:9092/1: 3 request(s) timed out: disconnect (after 418567ms in state UP)
ERROR rdkafka::client > librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down
What the broker says:
DEBUG 2022-01-27 20:49:34,595 [shard 0] storage-gc - disk_log_impl.cc:236 - [{kafka/topic_1/0}] time retention timestamp: {timestamp: 1642711774595}, first segment max timestamp: {timestamp: 1643252751288}
DEBUG 2022-01-27 20:49:34,595 [shard 0] storage-gc - disk_log_impl.cc:236 - [{kafka/topic_5/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
INFO 2022-01-28 13:00:07,060 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO 2022-01-28 13:00:07,326 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO 2022-01-28 13:00:07,859 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO 2022-01-28 13:00:08,912 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
WARN 2022-01-28 13:01:13,845 [shard 0] rpc - error connecting to 0.0.0.0:33145 - std::__1::system_error (error system:111, Connection refused)
We also observed an sustained increase in memory usage (~1GB/5hrs) with the same workload.
JIRA Link: CORE-824