-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
What happened?
Put request with lease takes linear apply time depending on the number of keys already attached to the lease. With the increasing number of keys and constant QPS, it's increasing to the point, where etcd spend almost 100% of time applying put requests making lease renewal fail.
What did you expect to happen?
I would expect put with lease not to depend linearly on number of already attached keys to lease.
How can we reproduce it (as minimally and precisely as possible)?
Run Etcd with:
docker run -p 2379:2379 -p 2380:2380 --mount type=bind,source=/tmp/etcd-data.tmp,destination=/etcd-data --name etcd-gcr-v3.5.9 gcr.io/etcd-development/etcd:v3.5.9 /usr/local/bin/etcd --name s1 --data-dir /etcd-data --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --listen-peer-urls http://0.0.0.0:2380 --initial-advertise-peer-urls http://0.0.0.0:2380 --initial-cluster s1=http://0.0.0.0:2380 --initial-cluster-token tkn --initial-cluster-state new --auto-compaction-retention=1 --log-level info --logger zap --log-outputs stderr
Test client:
package main
import (
"context"
"fmt"
"strconv"
"sync"
"time"
client "go.etcd.io/etcd/client/v3"
"go.etcd.io/etcd/client/v3/concurrency"
"golang.org/x/time/rate"
)
func main() {
c, err := client.New(client.Config{
Endpoints: []string{"localhost:2379"},
DialTimeout: 5 * time.Second})
fmt.Println("Opening session")
session, err := concurrency.NewSession(c, concurrency.WithTTL(int(30)))
if err != nil {
return
}
limiter := rate.NewLimiter(1000, 20)
maxDuration := time.Duration(0)
mu := sync.Mutex{}
fmt.Println("Starting writes")
for i := 0; i <= 10000000; i++ {
i := i
limiter.Wait(context.Background())
key := "asdasdasdasdasd" + strconv.Itoa(i)
go func() {
leaseID := session.Lease()
start := time.Now()
c.Put(context.Background(), key, key, client.WithLease(leaseID))
duration := time.Since(start)
mu.Lock()
defer mu.Unlock()
if duration > maxDuration {
fmt.Printf("MAX latency: %v, entries: %d\n", duration, i)
maxDuration = duration
}
if i%1000 == 0 {
fmt.Printf("latency: %v, entries: %d\n", duration, i)
}
}()
}
}
Apply duration increase linearly:
Anything else we need to know?
The above repro is using quite a significant QPS of 1k, but I've observed a similar issue with ~50k entries attached to the lease and only 20 QPS of put requests:
As you can see above, with sum(rate(etcd_server_apply_duration_seconds_sum[10s]))
approaching 1.0 - leases start failing to renew (red line is Lease renewal returning unavailable error)
Etcd version (please run commands below)
I've tested both 3.5.9 (repro) and 3.5.4 (production setup with 20 QPS)
Etcd configuration (command line flags or environment variables)
Provided in reproduction
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
No response
Relevant log output
No response