Skip to content

Put operation with lease takes linear apply time depending on number of keys already attached to lease  #15993

@marseel

Description

@marseel

What happened?

Put request with lease takes linear apply time depending on the number of keys already attached to the lease. With the increasing number of keys and constant QPS, it's increasing to the point, where etcd spend almost 100% of time applying put requests making lease renewal fail.

What did you expect to happen?

I would expect put with lease not to depend linearly on number of already attached keys to lease.

How can we reproduce it (as minimally and precisely as possible)?

Run Etcd with:

docker run   -p 2379:2379   -p 2380:2380   --mount type=bind,source=/tmp/etcd-data.tmp,destination=/etcd-data   --name etcd-gcr-v3.5.9   gcr.io/etcd-development/etcd:v3.5.9   /usr/local/bin/etcd   --name s1   --data-dir /etcd-data   --listen-client-urls http://0.0.0.0:2379   --advertise-client-urls http://0.0.0.0:2379   --listen-peer-urls http://0.0.0.0:2380   --initial-advertise-peer-urls http://0.0.0.0:2380   --initial-cluster s1=http://0.0.0.0:2380   --initial-cluster-token tkn   --initial-cluster-state new  --auto-compaction-retention=1 --log-level info   --logger zap   --log-outputs stderr

Test client:

package main

import (
	"context"
	"fmt"
	"strconv"
	"sync"
	"time"

	client "go.etcd.io/etcd/client/v3"
	"go.etcd.io/etcd/client/v3/concurrency"
	"golang.org/x/time/rate"
)

func main() {
	c, err := client.New(client.Config{
		Endpoints:   []string{"localhost:2379"},
		DialTimeout: 5 * time.Second})
	fmt.Println("Opening session")
	session, err := concurrency.NewSession(c, concurrency.WithTTL(int(30)))
	if err != nil {
		return
	}

	limiter := rate.NewLimiter(1000, 20)
	maxDuration := time.Duration(0)
	mu := sync.Mutex{}

	fmt.Println("Starting writes")
	for i := 0; i <= 10000000; i++ {
		i := i
		limiter.Wait(context.Background())
		key := "asdasdasdasdasd" + strconv.Itoa(i)
		go func() {
			leaseID := session.Lease()
			start := time.Now()
			c.Put(context.Background(), key, key, client.WithLease(leaseID))
			duration := time.Since(start)

			mu.Lock()
			defer mu.Unlock()
			if duration > maxDuration {
				fmt.Printf("MAX latency: %v, entries: %d\n", duration, i)
				maxDuration = duration
			}
			if i%1000 == 0 {
				fmt.Printf("latency: %v, entries: %d\n", duration, i)
			}
		}()
	}
}

Apply duration increase linearly:
Screenshot from 2023-06-01 12-55-31

Anything else we need to know?

The above repro is using quite a significant QPS of 1k, but I've observed a similar issue with ~50k entries attached to the lease and only 20 QPS of put requests:
image
As you can see above, with sum(rate(etcd_server_apply_duration_seconds_sum[10s])) approaching 1.0 - leases start failing to renew (red line is Lease renewal returning unavailable error)

Etcd version (please run commands below)

I've tested both 3.5.9 (repro) and 3.5.4 (production setup with 20 QPS)

Etcd configuration (command line flags or environment variables)

Provided in reproduction

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

No response

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions