Explosion of errors & latency after few hours of sustained production

### Version & Environment

Redpanda version: (use `rpk version`): v21.11.3 (rev b3e78b1)

Producer is a Rust program that uses `rust-rdkafka` (wrapper for `librdkafka`, at v1.8.2). Produces ~200msg/s sent to 14 topics (1 partition, no replication, not compact), with the following settings:
- `security.protocol`: "sasl_plaintext"
- `sasl.mechanism`: "SCRAM-SHA-256"
- `message.timeout.ms`: "50"
- `queue.buffering.max.ms`: "1"
- `enable.idempotence`: "true"
- `message.send.max.retries`: "10"
- `retry.backoff.ms`: "1"

The Redpanda cluster is a single broker in production mode running on Ubuntu based `r6gd.large` (arm64) AWS instance with enabled idempotency and enforced SASL. The producer works on a close, other instance in the same subnet (which makes IMO network issues the unlikely cause, see below).


### What went wrong?

In an image:

<img width="1111" alt="Screenshot 2022-01-28 at 10 43 27" src="https://user-images.githubusercontent.com/13498492/151550395-dfc5a011-0b9a-4cdc-8c72-b259f754798c.png">

After hours, the number of errors explodes at both the producer & the broker. After having reproduced this several times, and tried different settings for the producer, I came to the conclusion that the problem is the broker, thus redpanda.

At first I was using an older version of redpanda and compact topics, but I reproduced this with the latest redpanda version and with 14 single partition, non-compact topics. 

### What should have happened instead?

Should be staying stable and work as usual.

### How to reproduce the issue?

1. Start a single node cluster on ARM64 machine with idempotency and SASL enabled
2. Start a producer with settings and behaviour as described above
3. Let it run for several hours (sometimes it happens early, sometimes it needs +10hrs)

### Additional information

Here is what the producer says (many, many times):
```
ERROR librdkafka > librdkafka: FAIL [thrd:sasl_plaintext://10.2.0.24:9092/bootstrap]: sasl_plaintext://10.2.0.24:9092/1: 3 request(s) timed out: disconnect (after 418567ms in state UP)
ERROR rdkafka::client > librdkafka: Global error: OperationTimedOut (Local: Timed out): sasl_plaintext://10.2.0.24:9092/1: 3 request(s) timed out: disconnect (after 418567ms in state UP)
ERROR rdkafka::client > librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down
```

What the broker says:
```
DEBUG 2022-01-27 20:49:34,595 [shard 0] storage-gc - disk_log_impl.cc:236 - [{kafka/topic_1/0}] time retention timestamp: {timestamp: 1642711774595}, first segment max timestamp: {timestamp: 1643252751288}
DEBUG 2022-01-27 20:49:34,595 [shard 0] storage-gc - disk_log_impl.cc:236 - [{kafka/topic_5/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
```

```
INFO  2022-01-28 13:00:07,060 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO  2022-01-28 13:00:07,326 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO  2022-01-28 13:00:07,859 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO  2022-01-28 13:00:08,912 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
WARN  2022-01-28 13:01:13,845 [shard 0] rpc - error connecting to 0.0.0.0:33145 - std::__1::system_error (error system:111, Connection refused)
```

We also observed an sustained increase in memory usage (~1GB/5hrs) with the same workload.

JIRA Link: [CORE-824](https://redpandadata.atlassian.net/browse/CORE-824)

[CORE-824]: https://redpandadata.atlassian.net/browse/CORE-824?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explosion of errors & latency after few hours of sustained production #3643

Version & Environment

What went wrong?

What should have happened instead?

How to reproduce the issue?

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explosion of errors & latency after few hours of sustained production #3643

Description

Version & Environment

What went wrong?

What should have happened instead?

How to reproduce the issue?

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions