Skip to content

The watch leader loop should exit ASAP when etcd leader changes #8823

@rleungx

Description

@rleungx

Bug Report

pd/pkg/syncer/client.go

Lines 149 to 169 in 40fe83e

if err != nil {
if ev, ok := status.FromError(err); ok {
if ev.Code() == codes.Canceled {
return
}
}
log.Error("server failed to establish sync stream with leader", zap.String("server", s.server.Name()), zap.String("leader", s.server.GetLeader().GetName()), errs.ZapError(err))
time.Sleep(time.Second)
continue
}
log.Info("server starts to synchronize with leader", zap.String("server", s.server.Name()), zap.String("leader", s.server.GetLeader().GetName()), zap.Uint64("request-index", s.history.getNextIndex()))
for {
resp, err := stream.Recv()
if err != nil {
s.streamingRunning.Store(false)
log.Error("region sync with leader meet error", errs.ZapError(errs.ErrGRPCRecv, err))
if err = stream.CloseSend(); err != nil {
log.Error("failed to terminate client stream", errs.ZapError(errs.ErrGRPCCloseSend, err))
}
time.Sleep(time.Second)
break

Right now, we use sleep to prevent retrying to sync with the leader too frequently. However, it will indeed affect the PD election time if the etcd leader changes. The follower must wait for 1s to exit the loop before starting the election. See both line 156 and line 168. We can use the context to make it exit ASAP.

Here is the case that slows down the leader election:

  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:1348] [\"ec5a786a755b23bd [term 74] received MsgTimeoutNow from 81a4031a31407385 and starts an election to get leadership.\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:719] [\"ec5a786a755b23bd became candidate at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:830] [\"ec5a786a755b23bd received MsgVoteResp from ec5a786a755b23bd at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:817] [\"ec5a786a755b23bd [logterm: 74, index: 671118265] sent MsgVote request to 6b3fd9f5a8c74ff3 at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:817] [\"ec5a786a755b23bd [logterm: 74, index: 671118265] sent MsgVote request to 81a4031a31407385 at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:817] [\"ec5a786a755b23bd [logterm: 74, index: 671118265] sent MsgVote request to ad4c4e35a9bd4181 at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:817] [\"ec5a786a755b23bd [logterm: 74, index: 671118265] sent MsgVote request to f587050a2df8fa3f at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [node.go:333] [\"raft.node: ec5a786a755b23bd lost leader 81a4031a31407385 at term 75\"]",
  "[2024/11/14 18:56:49.249 +00:00] [INFO] [raft.go:830] [\"ec5a786a755b23bd received MsgVoteResp from 81a4031a31407385 at term 75\"]",
  "[2024/11/14 18:56:49.249 +00:00] [INFO] [raft.go:1295] [\"ec5a786a755b23bd has received 2 MsgVoteResp votes and 0 vote rejections\"]",
  "[2024/11/14 18:56:49.250 +00:00] [INFO] [raft.go:830] [\"ec5a786a755b23bd received MsgVoteResp from 6b3fd9f5a8c74ff3 at term 75\"]",
  "[2024/11/14 18:56:49.250 +00:00] [INFO] [raft.go:1295] [\"ec5a786a755b23bd has received 3 MsgVoteResp votes and 0 vote rejections\"]",
  "[2024/11/14 18:56:49.250 +00:00] [INFO] [raft.go:771] [\"ec5a786a755b23bd became leader at term 75\"]",
  "[2024/11/14 18:56:49.250 +00:00] [INFO] [node.go:327] [\"raft.node: ec5a786a755b23bd elected leader ec5a786a755b23bd at term 75\"]",
  "[2024/11/14 18:56:49.284 +00:00] [INFO] [leadership.go:375] [\"current leadership is deleted\"] [revision=647767897] [leader-key=/pd/7366397889337882103/leader] [purpose=\"leader election\"]",
  "[2024/11/14 18:56:49.284 +00:00] [ERROR] [client.go:161] [\"region sync with leader meet error\"] [error=\"[PD:grpc:ErrGRPCRecv]receive response error: rpc error: code = Canceled desc = context canceled\"]",
  "[2024/11/14 18:56:50.284 +00:00] [INFO] [server.go:1669] [\"pd leader has changed, try to re-campaign a pd leader\"]",
  "[2024/11/14 18:56:50.284 +00:00] [INFO] [server.go:1706] [\"start to campaign PD leader\"] [campaign-leader-name=xxx.pd-1]",
  "[2024/11/14 18:56:50.286 +00:00] [INFO] [lease.go:66] [\"lease granted\"] [lease-id=4720215488460578587] [lease-timeout=3] [purpose=\"leader election\"]",
  "[2024/11/14 18:56:50.288 +00:00] [INFO] [leadership.go:182] [\"check campaign resp\"] [resp=\"{\\\"header\\\":{\\\"cluster_id\\\":8535908639037064845,\\\"member_id\\\":17031057339485725629,\\\"revision\\\":64776
7925,\\\"raft_term\\\":75},\\\"succeeded\\\":true,\\\"responses\\\":[{\\\"Response\\\":{\\\"ResponsePut\\\":{\\\"header\\\":{\\\"revision\\\":647767925}}}}]}\"]",
  "[2024/11/14 18:56:50.288 +00:00] [INFO] [leadership.go:191] [\"write leaderData to leaderPath ok\"] [leader-key=/pd/7366397889337882103/leader] [purpose=\"leader election\"]",
  "[2024/11/14 18:56:50.288 +00:00] [INFO] [server.go:1732] [\"campaign PD leader ok\"] [campaign-leader-name=xxx.pd-1]",

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects-5.4This bug affects the 5.4.x(LTS) versions.affects-6.1This bug affects the 6.1.x(LTS) versions.affects-6.5This bug affects the 6.5.x(LTS) versions.affects-7.1This bug affects the 7.1.x(LTS) versions.affects-7.5This bug affects the 7.5.x(LTS) versions.affects-8.1This bug affects the 8.1.x(LTS) versions.affects-8.5This bug affects the 8.5.x(LTS) versions.report/customerCustomers have encountered this bug.severity/majortype/bugThe issue is confirmed as a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions