Skip to content

Service/backend ID pool is not scaling with bpf lb map size #16121

@Weil0ng

Description

@Weil0ng

Bug report

In large scale use cases (>64k k8s endpoints), Cilium would complain the following first when backend number grows over 64k:

level=error msg="Error while inserting service in LB map" error="Unable to acquire backend ID for {\"10.74.184.36\" {\"TCP\" 'P'} '\\x00'}: no service ID available" k8sNamespace=test-zqygrq-14 k8sSvcName=small-service-9 subsys=k8s-watcher

then later if the same backend is touched, it'll crash with the following stacktrace:

"panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x20892d2]
goroutine 143 [running]:
github.com/cilium/cilium/pkg/service.(*Service).updateBackendsCacheLocked(0xc0001e1980, 0xc0017947e0, 0xc02b778c80, 0x3, 0x4, 0xc02b778c80, 0x2, 0x4, 0x0, 0x5577bfe1634237c4, ...)
	/go/src/github.com/cilium/cilium/pkg/service/service.go:943 +0x912
github.com/cilium/cilium/pkg/service.(*Service).UpsertService(0xc0001e1980, 0xc025cca840, 0x0, 0x0, 0x0)
	/go/src/github.com/cilium/cilium/pkg/service/service.go:305 +0xb16
github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).addK8sSVCs(0xc0009ef040, 0xc02797b120, 0x10, 0xc02797b100, 0xe, 0x0, 0xc00aa4c3f0, 0xc01b9243f0, 0x0, 0x0)
	/go/src/github.com/cilium/cilium/pkg/k8s/watchers/watcher.go:769 +0x47b
github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).k8sServiceHandler.func1(0x0, 0xc02797b120, 0x10, 0xc02797b100, 0xe, 0xc00aa4c3f0, 0x0, 0xc01b9243f0, 0xc000b39fb0)
	/go/src/github.com/cilium/cilium/pkg/k8s/watchers/watcher.go:473 +0xa76
github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).k8sServiceHandler(0xc0009ef040)
	/go/src/github.com/cilium/cilium/pkg/k8s/watchers/watcher.go:516 +0x95
created by github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).RunK8sServiceHandler
	/go/src/github.com/cilium/cilium/pkg/k8s/watchers/watcher.go:521 +0x3f

Discussed offline with @aanm and @brb , issue seems to be the MaxID const defined in https://github.com/cilium/cilium/blob/ffe0d11b398e85d7860fefd71c2a38a252405059/pkg/service/const.go limits how many services/backends can be stored regardless of bpf-lb-map-size.

Reason for this hard limit in control plane is that datapath is using u16 as type for backend ID, there's proposal that relaxes this limit (#16110).

General Information

  • Cilium version (run cilium version): 1.9.5
  • Orchestration system version in use (e.g. kubectl version, ...): 1.20.6

How to reproduce the issue

  1. Install Cilium with bpf-lb-map-size set to >64k.
  2. Create services that total the number of backends to be >64k.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugThis is a bug in the Cilium logic.pinnedThese issues are not marked stale by our issue bot.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions