Skip to content

PatrolKeyspaceAssignment got stuck in the middle of process for hours #6628

@binshi-bing

Description

@binshi-bing

Enhancement Task

[2023/06/16 20:44:48.545 +00:00] [INFO] [tso_keyspace_group.go:172] ["enter into SplitKeyspaceGroupByID !!!!!!!!!!!!!!!!"]
[2023/06/16 20:44:48.545 +00:00] [INFO] [tso_keyspace_group.go:192] ["SplitKeyspaceGroupByID splitParams.Keyspaces !!!!!!!!!!!!!!!!"] [splitParams.Keyspaces="[4000]"]
[2023/06/16 20:44:48.545 +00:00] [INFO] [keyspace.go:673] ["SplitKeyspaceGroupByID after PatrolKeyspaceAssignment !!!!!!!!!!!!!!!!"]
[2023/06/16 20:44:48.545 +00:00] [INFO] [keyspace.go:676] ["SplitKeyspaceGroupByID in PatrolKeyspaceAssignment !!!!!!!!!!!!!!!!"] [next-start-id=0]
[2023/06/16 20:44:48.561 +00:00] [INFO] [keyspace.go:676] ["SplitKeyspaceGroupByID in PatrolKeyspaceAssignment !!!!!!!!!!!!!!!!"] [next-start-id=120]
[2023/06/16 20:44:48.572 +00:00] [INFO] [keyspace.go:676] ["SplitKeyspaceGroupByID in PatrolKeyspaceAssignment !!!!!!!!!!!!!!!!"] [next-start-id=240]
[2023/06/16 20:44:48.599 +00:00] [INFO] [keyspace.go:676] ["SplitKeyspaceGroupByID in PatrolKeyspaceAssignment !!!!!!!!!!!!!!!!"] [next-start-id=360]
...
[2023/06/16 20:44:48.843 +00:00] [INFO] [keyspace.go:676] ["SplitKeyspaceGroupByID in PatrolKeyspaceAssignment !!!!!!!!!!!!!!!!"] [next-start-id=3803]
[2023/06/16 20:44:48.853 +00:00] [INFO] [keyspace.go:676] ["SplitKeyspaceGroupByID in PatrolKeyspaceAssignment !!!!!!!!!!!!!!!!"] [next-start-id=4008]

There are more keyspaces to go, but it neither proceed nor exit from the assignment loop.

This might be related -- the keyspaces aren't consequtives.

Environment

dev env (dev-us-east-1-f01), namespace(tidb-serverless)

Root Cause Analysis

It's observed that it's blocked on manager.metaLock.Lock(keyspace.Id).

load 128 keyspaces (meta) from etcd (/pd/{cluster id}/keyspaces/meta/)
for _, keyspace := range keyspaces {
    ...
    manager.metaLock.Lock(keyspace.Id)
    ...

metaLock is created with syncutil.NewLockGroup(syncutil.WithHash(MaskKeyspaceID)),

// MaskKeyspaceID is used to hash the spaceID inside the lockGroup.
// A simple mask is applied to spaceID to use its last byte as map key,
// limiting the maximum map length to 256.
// Since keyspaceID is sequentially allocated, this can also reduce the chance
// of collision when comparing with random hashes.
func MaskKeyspaceID(id uint32) uint32 {
return id & 0xFF
}

Which means that if two keyspace ids in the loaded batch have the same last byte, it encounters deadlock. In our case, the keyspace ids loaded from keyspace meta storage isn't consecutive -- though the keyspace id is created sequentially, the stored keyspace meta might be expired or deleted as time goes.

There are holes in the keyspace range in keyspace meta space:

...
/pd/7187976276065784319/keyspaces/meta/00000862
/pd/7187976276065784319/keyspaces/meta/00000863
/pd/7187976276065784319/keyspaces/meta/00000864
/pd/7187976276065784319/keyspaces/meta/00000880
/pd/7187976276065784319/keyspaces/meta/00000899
/pd/7187976276065784319/keyspaces/meta/00000920
/pd/7187976276065784319/keyspaces/meta/00000921
...

In our case, the process stuck at MetaLock.Lock(ks-id=4520) which has the same last byte 0xA8 as the first keyspace id 4008 in this batch, because MetaLock.Lock(ks-id=4008) already acquired.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/enhancementThe issue or PR belongs to an enhancement.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions