Improve slow follower reliability/performance

### What would you like to be added?

I was debugging a couple test failures today in our CI environment, we're running three node v3.5.3 on a cloud with a V in it. 
Looking at the metrics I saw one follower being consistently very slow (high fsync latency), now theoretically this shouldn't impact any writes as we still have the leader and another fast follower to persist the messages safely.

![image](https://user-images.githubusercontent.com/85170409/191543122-c2f2b007-7023-4f51-b668-8f71de1f4830.png)

Sorry, this should've been a sequence diagram, but that's all I got between meetings :) Effectively the message should be sent to slow etcd-2, but etcd-3 would be faster to respond back with the entry committed and we can move on. etcd-2 can eventually catch up on the log.

What I do see in reality, however, is overall bad latency on the whole write operation which suspiciously makes it look like it waits for both (fast+slow) follower to apply the entry.

I found this code path explaining why this was done that way, which makes sense:
https://github.com/etcd-io/etcd/blob/main/server/etcdserver/raft.go#L287-L313

Did I get this right or does it work as described initially? In the former case, how can we mitigate the slow-follower issue?

### Why is this needed?

A slow follower causes a lot of negative impact (eg slow writes and higher error rates) on an otherwise healthy quorum. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve slow follower reliability/performance #14501

What would you like to be added?

Why is this needed?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve slow follower reliability/performance #14501

Description

What would you like to be added?

Why is this needed?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions