Skip to content

Conversation

ryanwuer
Copy link
Contributor

@ryanwuer ryanwuer commented Apr 1, 2025

try to solve the TODO:

// TODO(brancz): use cache.Indexer to index endpointslices by
// LabelServiceName so this operation doesn't have to iterate over all
// endpoint objects.
for _, obj := range e.endpointSliceStore.List() {
	esa, err := e.getEndpointSliceAdaptor(obj)
	if err != nil {
		e.logger.Error("converting to EndpointSlice object failed", "err", err)
		continue
	}
	if lv, exists := esa.labels()[esa.labelServiceName()]; exists && lv == svc.Name {
		e.enqueue(esa.get())
	}
}

@ryanwuer ryanwuer requested a review from brancz as a code owner April 1, 2025 08:03
@ryanwuer ryanwuer changed the title refactor(endpointslice): use cache.Indexer to achieve better iteration refactor(endpointslice): use service cache.Indexer to achieve better iteration performance Apr 1, 2025
@ryanwuer
Copy link
Contributor Author

ryanwuer commented Apr 1, 2025

@brancz Hi, could you please take a look at this PR? Does it meet expectations?

@ryanwuer
Copy link
Contributor Author

ryanwuer commented Apr 1, 2025

CI / More Go tests failed because of the error below, which looks like has nothing to do with my PR?

--- FAIL: TestQueryLog (0.00s)
    --- FAIL: TestQueryLog/api_queries,_127.0.0.1:36843,_enabled_at_start (0.64s)
        query_log_test.go:322: time=2025-04-01T08:07:57.583Z level=INFO source=main.go:660 msg="No time or size retention was set so using the default time retention" duration=15d
            time=2025-04-01T08:07:57.583Z level=INFO source=main.go:711 msg="Starting Prometheus Server" mode=server version="(version=, branch=, revision=unknown)"
            time=2025-04-01T08:07:57.583Z level=WARN source=main.go:713 msg="This Prometheus binary has not been compiled for a 64-bit architecture. Due to virtual memory constraints of 32-bit systems, it is highly recommended to switch to a 64-bit binary of Prometheus." GOARCH=386
            time=2025-04-01T08:07:57.583Z level=INFO source=main.go:716 msg="operational information" build_context="(go=go1.23.7, platform=linux/386, user=, date=, tags=unknown)" host_details="(Linux 6.8.0-1021-azure #25-Ubuntu SMP Wed Jan 15 20:45:09 UTC 2025 x86_64 fd3d6ffe3ac4 (none))" fd_limits="(soft=1048576, hard=1048576)" vm_limits="(soft=unlimited, hard=unlimited)"
            time=2025-04-01T08:07:57.584Z level=INFO source=main.go:792 msg="Leaving GOMAXPROCS=4: CPU quota undefined" component=automaxprocs
            time=2025-04-01T08:07:57.589Z level=INFO source=web.go:654 msg="Start listening for connections" component=web address=127.0.0.1:368[43](https://github.com/prometheus/prometheus/actions/runs/14190417155/job/39753627384?pr=16365#step:7:44)
            time=2025-04-01T08:07:57.589Z level=ERROR source=main.go:1018 msg="Unable to start web listener" err="listen tcp 127.0.0.1:36843: bind: address already in use"
            
        query_log_test.go:116: 
            	Error Trace:	/__w/prometheus/prometheus/cmd/prometheus/query_log_test.go:116
            	            				/__w/prometheus/prometheus/cmd/prometheus/query_log_test.go:341
            	            				/__w/prometheus/prometheus/cmd/prometheus/query_log_test.go:479
            	Error:      	Received unexpected error:
            	            	Get "http://127.0.0.1:36843/api/v1/query_range?step=5&start=0&end=3600&query=query_with_api": dial tcp 127.0.0.1:36843: connect: connection refused
            	Test:       	TestQueryLog/api_queries,_127.0.0.1:36843,_enabled_at_start
FAIL

@ryanwuer
Copy link
Contributor Author

ryanwuer commented Apr 3, 2025

@machine424 Please help confirm if this PR makes any sense when you got some free time.

Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this.
Some comments.
We'll need to give this a try on a real kube cluster, let's discuss that once we're happy with the code.

@ryanwuer
Copy link
Contributor Author

ryanwuer commented Apr 9, 2025

@machine424 After TestEndpointSliceDiscoveryWithUnrelatedServiceUpdate executed, no service label is patched into endpointslice's labels, which means update of other namespace's service won't make any effect to endpointslice discovery.

Labels: model.LabelSet{
	"__meta_kubernetes_endpointslice_address_type":                            "IPv4",
	"__meta_kubernetes_endpointslice_name":                                    "testendpoints",
	"__meta_kubernetes_endpointslice_label_kubernetes_io_service_name":        "testendpoints",
	"__meta_kubernetes_endpointslice_labelpresent_kubernetes_io_service_name": "true",
	"__meta_kubernetes_endpointslice_annotation_test_annotation":              "test",
	"__meta_kubernetes_endpointslice_annotationpresent_test_annotation":       "true",
	"__meta_kubernetes_namespace":                                             "default",
}

I debug locally and find len(endpointSlices) is 0 when service in default2 namespace is update. So namespacedName() is not necessary. Please help confirm the logic.

endpointSlices, err := e.endpointSliceInf.GetIndexer().ByIndex(serviceIndex, svc.Name)

@ryanwuer
Copy link
Contributor Author

For best practice considerations, i'm still willing to use namespacedName().

@machine424
Copy link
Member

I should have been more explicit.

It's true that the service informer is namespaced

s := d.client.CoreV1().Services(namespace)
, but only when the config itself is namespaced; Prometheus could be given cluster wide permissions and we'd end up with cluster wide informers
return []string{apiv1.NamespaceAll}

So, for some setups, a change to a service with the same name but from another namespace could still enqueue

the endpointslice to do unnecessary work, during processing, the labels of the appropriate service in the same namespace will end up being used, but we'll still be doing unnecessary work.

having services with the same name across different namespaces may not be that common, but it's always good to avoid that unnecessary work.

The test TestEndpointSliceDiscoveryWithUnrelatedServiceUpdate would need to make sure no enqueue/no-op processing happens when an unrelated service is updated AND the SD is configured to watch resources cluster wide (via n, c := makeDiscovery(RoleEndpointSlice, NamespaceDiscovery{}, makeEndpointSliceV1()) ). The problem is that, I don't think, we currently have test tools to make sure of that.

So let's do this in two steps:

  • Make if lv, exists := es.Labels[v1.LabelServiceName]; exists && lv == svc.Name { namespaced in one PR, and maybe add a test if you could think about a check (no need to spend time on this if not)
  • Introduce the indexer in another PR.

@ryanwuer
Copy link
Contributor Author

@machine424 Really appreciate your explanation and suggestions. I've make another PR to Make if lv, exists := es.Labels[v1.LabelServiceName]; exists && lv == svc.Name { namespaced. Pls help take a look.

#16433

@machine424
Copy link
Member

could you rebase?

ryanwuer added 2 commits May 9, 2025 17:20
… LabelServiceName so not have to iterate over all endpoint objects.

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
…hUnrelatedServiceUpdate' unit test to give a regression test

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
@ryanwuer ryanwuer force-pushed the endpointslice-refactor branch from 35cad48 to 6a07992 Compare May 9, 2025 09:21
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
@ryanwuer
Copy link
Contributor Author

ryanwuer commented May 9, 2025

@machine424 Rebase is done. I just made another commit to make service indexer namesapced. Pls take a look to see if it makes sense.

Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks again!
just some nits.

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Ryan Wu <rongjun0821@gmail.com>
@ryanwuer
Copy link
Contributor Author

Suggestions applied. Thanks.

Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks again!

@machine424
Copy link
Member

machine424 commented May 20, 2025

Well, I even approved it twice, there couldn't be anything wrong with it :)

@machine424 machine424 merged commit 091e662 into prometheus:main May 20, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants