Skip to content

Conversation

chancez
Copy link
Contributor

@chancez chancez commented Mar 13, 2023

@kaworu found a bug when testing another PR based on master when using httpV2 metrics:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x23a666e]

goroutine 1271 [running]:
github.com/cilium/cilium/pkg/hubble/metrics/http.(*httpHandler).ListMetricVec(0xc00079c150)
	/go/src/github.com/cilium/cilium/pkg/hubble/metrics/http/handler.go:85 +0x4e
github.com/cilium/cilium/pkg/hubble/metrics/api.Handlers.ProcessPodDeletion({{0xc000748680, 0x6, 0x8}, {0xc000748700, 0x6, 0x8}}, 0xc001f56820)
	/go/src/github.com/cilium/cilium/pkg/hubble/metrics/api/api.go:134 +0xad
github.com/cilium/cilium/pkg/hubble/metrics.initPodDeletionHandler.func1()
	/go/src/github.com/cilium/cilium/pkg/hubble/metrics/metrics.go:90 +0x45
created by github.com/cilium/cilium/pkg/hubble/metrics.initPodDeletionHandler
	/go/src/github.com/cilium/cilium/pkg/hubble/metrics/metrics.go:84 +0x9d

This was introduced in #23385 which seems to have handled all the other metrics correctly, but http is a bit special and it didn't handle the v2 metrics option.

I wrote a new test (second commit) which reproduces the panic:

 (⎈|kind-kind:N/A) ~/g/s/g/c/cilium ❯❯❯ go test -v ./pkg/hubble/metrics/http                                                                                                                                                                                                             master ⬆ ✭
=== RUN   Test_httpHandler_Status
--- PASS: Test_httpHandler_Status (0.00s)
=== RUN   Test_httpHandler_ProcessFlow
--- PASS: Test_httpHandler_ProcessFlow (0.00s)
=== RUN   Test_httpHandlerV2_ProcessFlow
--- PASS: Test_httpHandlerV2_ProcessFlow (0.00s)
=== RUN   Test_httpHandler_ListMetricVec
--- PASS: Test_httpHandler_ListMetricVec (0.00s)
=== RUN   Test_httpV2Handler_ListMetricVec
--- FAIL: Test_httpV2Handler_ListMetricVec (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x100c6e650]

goroutine 65 [running]:
testing.tRunner.func1.2({0x100e115c0, 0x10136d580})
	/Users/chancezibolski/.asdf/installs/golang/1.19.5/go/src/testing/testing.go:1396 +0x1c8
testing.tRunner.func1()
	/Users/chancezibolski/.asdf/installs/golang/1.19.5/go/src/testing/testing.go:1399 +0x378
panic({0x100e115c0, 0x10136d580})
	/Users/chancezibolski/.asdf/installs/golang/1.19.5/go/src/runtime/panic.go:884 +0x204
github.com/cilium/cilium/pkg/hubble/metrics/http.(*httpHandler).ListMetricVec(0x14000078f30)
	/Users/chancezibolski/go/src/github.com/cilium/cilium/pkg/hubble/metrics/http/handler.go:85 +0x60
github.com/cilium/cilium/pkg/hubble/metrics/http.Test_httpV2Handler_ListMetricVec(0x0?)
	/Users/chancezibolski/go/src/github.com/cilium/cilium/pkg/hubble/metrics/http/handler_test.go:216 +0x10c
testing.tRunner(0x140002fe820, 0x100ed4100)
	/Users/chancezibolski/.asdf/installs/golang/1.19.5/go/src/testing/testing.go:1446 +0x10c
created by testing.(*T).Run
	/Users/chancezibolski/.asdf/installs/golang/1.19.5/go/src/testing/testing.go:1493 +0x300
FAIL	github.com/cilium/cilium/pkg/hubble/metrics/http	0.204s

And the first commit fixes this by properly handling the difference between http v2 metrics and v1.

Fix panic in hubble http v2 metrics

@chancez chancez requested review from a team as code owners March 13, 2023 17:01
@chancez chancez self-assigned this Mar 13, 2023
@chancez chancez requested a review from kaworu March 13, 2023 17:01
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 13, 2023
@chancez
Copy link
Contributor Author

chancez commented Mar 13, 2023

Since #23385 isn't in a release yet, what release note label is appropriate? misc?

@michi-covalent michi-covalent added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Mar 13, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 13, 2023
@michi-covalent michi-covalent added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. needs-backport/1.13 labels Mar 13, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Mar 13, 2023
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

chancez added 2 commits March 13, 2023 12:35
The new ListMetricVec didn't differentiate between v1 and v2 and
incorrectly used a nil metric: h.responses  which isn't registered in
http v2 metrics.

Signed-off-by: Chance Zibolski <chance.zibolski@gmail.com>
http metrics has v1 and v2 plugins handled by the same httpHandler and
registers different metrics depending on v1 vs v2. This resulted in a
bug in ListMetricVec which didn't handle the v2 case.

Signed-off-by: Chance Zibolski <chance.zibolski@gmail.com>
@chancez
Copy link
Contributor Author

chancez commented Mar 13, 2023

Just hit #23079 in integration-tests and #24355 in conformance ingress. Retrying.

@chancez
Copy link
Contributor Author

chancez commented Mar 13, 2023

This includes a new unit test, and the remaining tests won't test this functionality (they don't enable hubble metrics, nor do they enable http v2 metrics, otherwise it would have been caught) so I think we can skip the remaining required tests.

@gandro What do you think?

Also, we really should add some kind of integration/e2e tests for Hubble metrics.

Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chancez!

@@ -82,7 +86,7 @@ func (h *httpHandler) Context() *api.ContextOptions {
}

func (h *httpHandler) ListMetricVec() []*prometheus.MetricVec {
return []*prometheus.MetricVec{h.requests.MetricVec, h.responses.MetricVec, h.duration.MetricVec}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is ok since the only caller of this function doesn't mutate the returned value.

@gandro
Copy link
Member

gandro commented Mar 14, 2023

This includes a new unit test, and the remaining tests won't test this functionality (they don't enable hubble metrics, nor do they enable http v2 metrics, otherwise it would have been caught) so I think we can skip the remaining required tests.

@gandro What do you think?

Also, we really should add some kind of integration/e2e tests for Hubble metrics.

I personally am fine with merging this just based on unit tests. But we have been discussing being more strict about running CI even if we think it's not really needed. Better safe than sorry.

@kaworu
Copy link
Member

kaworu commented Mar 14, 2023

/test

@chancez
Copy link
Contributor Author

chancez commented Mar 14, 2023

k8s-1.26-kernel-net-next failed on the VM setup:

07:01:13  ==> k8s3-1.26: Waiting for machine to boot. This may take a few minutes...
07:01:13      k8s3-1.26: SSH address: 127.0.0.1:2201
07:01:13      k8s3-1.26: SSH username: vagrant
07:01:13      k8s3-1.26: SSH auth method: private key
07:11:16  Timed out while waiting for the machine to boot. This means that
07:11:16  Vagrant was unable to communicate with the guest machine within
07:11:16  the configured ("config.vm.boot_timeout" value) time period.

@chancez
Copy link
Contributor Author

chancez commented Mar 14, 2023

/test-1.26-net-next

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 14, 2023
@jrajahalme jrajahalme merged commit 9583749 into master Mar 15, 2023
@jrajahalme jrajahalme deleted the pr/chancez/hubble_http_v2_fix_panic branch March 15, 2023 08:43
@NikAleksandrov
Copy link

@michi-covalent @chancez this PR was marked for backporting to 1.13 but I don't see the relevant commits that add ListMetricVec() there and quickly looking over that code don't see anything that does it currently in 1.13.
Do you still think this should be backported to 1.13?

@NikAleksandrov NikAleksandrov mentioned this pull request Mar 23, 2023
29 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants