Skip to content

Improve slow follower reliability/performance #14501

@tjungblu

Description

@tjungblu

What would you like to be added?

I was debugging a couple test failures today in our CI environment, we're running three node v3.5.3 on a cloud with a V in it.
Looking at the metrics I saw one follower being consistently very slow (high fsync latency), now theoretically this shouldn't impact any writes as we still have the leader and another fast follower to persist the messages safely.

image

Sorry, this should've been a sequence diagram, but that's all I got between meetings :) Effectively the message should be sent to slow etcd-2, but etcd-3 would be faster to respond back with the entry committed and we can move on. etcd-2 can eventually catch up on the log.

What I do see in reality, however, is overall bad latency on the whole write operation which suspiciously makes it look like it waits for both (fast+slow) follower to apply the entry.

I found this code path explaining why this was done that way, which makes sense:
https://github.com/etcd-io/etcd/blob/main/server/etcdserver/raft.go#L287-L313

Did I get this right or does it work as described initially? In the former case, how can we mitigate the slow-follower issue?

Why is this needed?

A slow follower causes a lot of negative impact (eg slow writes and higher error rates) on an otherwise healthy quorum.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions