refactor(endpointslice): use service cache.Indexer to achieve better iteration performance #16365

ryanwuer · 2025-04-01T08:03:58Z

try to solve the TODO:

// TODO(brancz): use cache.Indexer to index endpointslices by
// LabelServiceName so this operation doesn't have to iterate over all
// endpoint objects.
for _, obj := range e.endpointSliceStore.List() {
	esa, err := e.getEndpointSliceAdaptor(obj)
	if err != nil {
		e.logger.Error("converting to EndpointSlice object failed", "err", err)
		continue
	}
	if lv, exists := esa.labels()[esa.labelServiceName()]; exists && lv == svc.Name {
		e.enqueue(esa.get())
	}
}

ryanwuer · 2025-04-01T08:13:27Z

@brancz Hi, could you please take a look at this PR? Does it meet expectations?

ryanwuer · 2025-04-01T09:03:36Z

CI / More Go tests failed because of the error below, which looks like has nothing to do with my PR?

--- FAIL: TestQueryLog (0.00s)
    --- FAIL: TestQueryLog/api_queries,_127.0.0.1:36843,_enabled_at_start (0.64s)
        query_log_test.go:322: time=2025-04-01T08:07:57.583Z level=INFO source=main.go:660 msg="No time or size retention was set so using the default time retention" duration=15d
            time=2025-04-01T08:07:57.583Z level=INFO source=main.go:711 msg="Starting Prometheus Server" mode=server version="(version=, branch=, revision=unknown)"
            time=2025-04-01T08:07:57.583Z level=WARN source=main.go:713 msg="This Prometheus binary has not been compiled for a 64-bit architecture. Due to virtual memory constraints of 32-bit systems, it is highly recommended to switch to a 64-bit binary of Prometheus." GOARCH=386
            time=2025-04-01T08:07:57.583Z level=INFO source=main.go:716 msg="operational information" build_context="(go=go1.23.7, platform=linux/386, user=, date=, tags=unknown)" host_details="(Linux 6.8.0-1021-azure #25-Ubuntu SMP Wed Jan 15 20:45:09 UTC 2025 x86_64 fd3d6ffe3ac4 (none))" fd_limits="(soft=1048576, hard=1048576)" vm_limits="(soft=unlimited, hard=unlimited)"
            time=2025-04-01T08:07:57.584Z level=INFO source=main.go:792 msg="Leaving GOMAXPROCS=4: CPU quota undefined" component=automaxprocs
            time=2025-04-01T08:07:57.589Z level=INFO source=web.go:654 msg="Start listening for connections" component=web address=127.0.0.1:368[43](https://github.com/prometheus/prometheus/actions/runs/14190417155/job/39753627384?pr=16365#step:7:44)
            time=2025-04-01T08:07:57.589Z level=ERROR source=main.go:1018 msg="Unable to start web listener" err="listen tcp 127.0.0.1:36843: bind: address already in use"
            
        query_log_test.go:116: 
            	Error Trace:	/__w/prometheus/prometheus/cmd/prometheus/query_log_test.go:116
            	            				/__w/prometheus/prometheus/cmd/prometheus/query_log_test.go:341
            	            				/__w/prometheus/prometheus/cmd/prometheus/query_log_test.go:479
            	Error:      	Received unexpected error:
            	            	Get "http://127.0.0.1:36843/api/v1/query_range?step=5&start=0&end=3600&query=query_with_api": dial tcp 127.0.0.1:36843: connect: connection refused
            	Test:       	TestQueryLog/api_queries,_127.0.0.1:36843,_enabled_at_start
FAIL

ryanwuer · 2025-04-03T02:34:39Z

@machine424 Please help confirm if this PR makes any sense when you got some free time.

machine424

thanks for this.
Some comments.
We'll need to give this a try on a real kube cluster, let's discuss that once we're happy with the code.

discovery/kubernetes/kubernetes.go

ryanwuer · 2025-04-09T13:17:42Z

@machine424 After TestEndpointSliceDiscoveryWithUnrelatedServiceUpdate executed, no service label is patched into endpointslice's labels, which means update of other namespace's service won't make any effect to endpointslice discovery.

Labels: model.LabelSet{
	"__meta_kubernetes_endpointslice_address_type":                            "IPv4",
	"__meta_kubernetes_endpointslice_name":                                    "testendpoints",
	"__meta_kubernetes_endpointslice_label_kubernetes_io_service_name":        "testendpoints",
	"__meta_kubernetes_endpointslice_labelpresent_kubernetes_io_service_name": "true",
	"__meta_kubernetes_endpointslice_annotation_test_annotation":              "test",
	"__meta_kubernetes_endpointslice_annotationpresent_test_annotation":       "true",
	"__meta_kubernetes_namespace":                                             "default",
}

I debug locally and find len(endpointSlices) is 0 when service in default2 namespace is update. So namespacedName() is not necessary. Please help confirm the logic.

endpointSlices, err := e.endpointSliceInf.GetIndexer().ByIndex(serviceIndex, svc.Name)

ryanwuer · 2025-04-10T03:55:08Z

For best practice considerations, i'm still willing to use namespacedName().

machine424 · 2025-04-15T11:01:40Z

I should have been more explicit.

It's true that the service informer is namespaced

prometheus/discovery/kubernetes/kubernetes.go

Line 411 in b4526c0

s := d.client.CoreV1().Services(namespace)

, but only when the config itself is namespaced; Prometheus could be given cluster wide permissions and we'd end up with cluster wide informers

prometheus/discovery/kubernetes/kubernetes.go

Line 271 in b4526c0

return []string{apiv1.NamespaceAll}

So, for some setups, a change to a service with the same name but from another namespace could still enqueue

prometheus/discovery/kubernetes/endpointslice.go

Line 114 in b4526c0

e.enqueue(es)

the endpointslice to do unnecessary work, during processing, the labels of the appropriate service in the same namespace will end up being used, but we'll still be doing unnecessary work.

having services with the same name across different namespaces may not be that common, but it's always good to avoid that unnecessary work.

The test TestEndpointSliceDiscoveryWithUnrelatedServiceUpdate would need to make sure no enqueue/no-op processing happens when an unrelated service is updated AND the SD is configured to watch resources cluster wide (via n, c := makeDiscovery(RoleEndpointSlice, NamespaceDiscovery{}, makeEndpointSliceV1()) ). The problem is that, I don't think, we currently have test tools to make sure of that.

So let's do this in two steps:

Make if lv, exists := es.Labels[v1.LabelServiceName]; exists && lv == svc.Name { namespaced in one PR, and maybe add a test if you could think about a check (no need to spend time on this if not)
Introduce the indexer in another PR.

ryanwuer · 2025-04-15T12:50:21Z

@machine424 Really appreciate your explanation and suggestions. I've make another PR to Make if lv, exists := es.Labels[v1.LabelServiceName]; exists && lv == svc.Name { namespaced. Pls help take a look.

#16433

machine424 · 2025-05-09T08:43:19Z

could you rebase?

… LabelServiceName so not have to iterate over all endpoint objects. Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

…hUnrelatedServiceUpdate' unit test to give a regression test Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

ryanwuer · 2025-05-09T09:34:06Z

@machine424 Rebase is done. I just made another commit to make service indexer namesapced. Pls take a look to see if it makes sense.

discovery/kubernetes/endpointslice_test.go

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

machine424

lgtm, thanks again!
just some nits.

discovery/kubernetes/endpointslice_test.go

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com> Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

ryanwuer · 2025-05-20T08:28:58Z

Suggestions applied. Thanks.

machine424

lgtm, thanks again!

machine424 · 2025-05-20T18:32:38Z

Well, I even approved it twice, there couldn't be anything wrong with it :)

ryanwuer requested a review from brancz as a code owner April 1, 2025 08:03

ryanwuer changed the title ~~refactor(endpointslice): use cache.Indexer to achieve better iteration~~ refactor(endpointslice): use service cache.Indexer to achieve better iteration performance Apr 1, 2025

machine424 reviewed Apr 8, 2025

View reviewed changes

discovery/kubernetes/kubernetes.go Outdated Show resolved Hide resolved

discovery/kubernetes/kubernetes.go Outdated Show resolved Hide resolved

ryanwuer requested review from roidelapluie, jesusvazquez and juliusv as code owners April 9, 2025 08:27

ryanwuer force-pushed the endpointslice-refactor branch from 1a215c9 to abb0834 Compare April 9, 2025 08:32

ryanwuer requested a review from machine424 April 9, 2025 08:51

ryanwuer added 2 commits May 9, 2025 17:20

refactor(endpointslice): use cache.Indexer to index endpointslices by…

210b368

… LabelServiceName so not have to iterate over all endpoint objects. Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

check the type and error early and add 'TestEndpointSliceDiscoveryWit…

6a07992

…hUnrelatedServiceUpdate' unit test to give a regression test Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

ryanwuer force-pushed the endpointslice-refactor branch from 35cad48 to 6a07992 Compare May 9, 2025 09:21

make service indexer namespaced

c19a99c

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

machine424 reviewed May 19, 2025

View reviewed changes

discovery/kubernetes/endpointslice_test.go Outdated Show resolved Hide resolved

remove unneeded test func

a580920

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

machine424 approved these changes May 20, 2025

View reviewed changes

discovery/kubernetes/endpointslice_test.go Outdated Show resolved Hide resolved

discovery/kubernetes/endpointslice_test.go Outdated Show resolved Hide resolved

Apply suggestions from code review

615ca67

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com> Signed-off-by: Ryan Wu <rongjun0821@gmail.com>

machine424 approved these changes May 20, 2025

View reviewed changes

machine424 merged commit 091e662 into prometheus:main May 20, 2025
27 checks passed

charleskorn mentioned this pull request May 26, 2025

Sync upstream at 14fc57e grafana/mimir-prometheus#878

Merged

refactor(endpointslice): use service cache.Indexer to achieve better iteration performance #16365

refactor(endpointslice): use service cache.Indexer to achieve better iteration performance #16365

Uh oh!

Conversation

ryanwuer commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryanwuer commented Apr 1, 2025

Uh oh!

ryanwuer commented Apr 1, 2025

Uh oh!

ryanwuer commented Apr 3, 2025

Uh oh!

machine424 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ryanwuer commented Apr 9, 2025

Uh oh!

ryanwuer commented Apr 10, 2025

Uh oh!

machine424 commented Apr 15, 2025

Uh oh!

ryanwuer commented Apr 15, 2025

Uh oh!

machine424 commented May 9, 2025

Uh oh!

ryanwuer commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

machine424 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ryanwuer commented May 20, 2025

Uh oh!

machine424 left a comment

Choose a reason for hiding this comment

Uh oh!

machine424 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanwuer commented Apr 1, 2025 •

edited

Loading

ryanwuer commented May 9, 2025 •

edited

Loading

machine424 commented May 20, 2025 •

edited

Loading