Skip to content

hubble-relay: refactor deprecated call to grpc.DialContext #36027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

devodev
Copy link
Contributor

@devodev devodev commented Nov 18, 2024

This PR replaces deprecated calls to grpc.DialContext with grpc.NewClient. It also removes the --dial-timeout flag from the CLI and deprecates the hubble.relay.dialTimeout in the Helm chart.

Rework of: #34261
Fixes: #33065
Fixes: #36070

Refactor deprecated call to grpc.DialContext in Hubble Relay

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 18, 2024
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Nov 18, 2024
@devodev devodev force-pushed the pr/devodev/hubble-grpc-dialcontext-deprecated branch from 5c4ffba to 9237203 Compare November 19, 2024 00:24
@devodev devodev marked this pull request as ready for review November 19, 2024 00:28
@devodev devodev requested review from a team as code owners November 19, 2024 00:28
@devodev devodev requested review from kaworu, joamaki, squeed and a user November 19, 2024 00:28
@chancez
Copy link
Contributor

chancez commented Nov 19, 2024

I'm a bit unsure of the attempts to retain previous behavior regarding connecting to peers eagerly, though I'm still looking at the options we have available. I'm also a bit concerned about the fact that Connect() is experimental.

I'm also not sure if this is true:

the manager would report connected right away for all peers that could create a ClientConn before we could confirm if we can actually reach them.

Because at least in some areas it's reporting the underlying connection state, so it would report idle in that case, until an RPC is made.

Can you clarify what you mean? Perhaps you could show what hubble list nodes reports with and without the extra logic? Maybe also what the hubble list nodes reports before and after an hubble observe flows?

@devodev
Copy link
Contributor Author

devodev commented Nov 19, 2024

I'm a bit unsure of the attempts to retain previous behavior regarding connecting to peers eagerly, though I'm still looking at the options we have available. I'm also a bit concerned about the fact that Connect() is experimental.

I'm also not sure if this is true:

the manager would report connected right away for all peers that could create a ClientConn before we could confirm if we can actually reach them.

Because at least in some areas it's reporting the underlying connection state, so it would report idle in that case, until an RPC is made.

Can you clarify what you mean? Perhaps you could show what hubble list nodes reports with and without the extra logic? Maybe also what the hubble list nodes reports before and after an hubble observe flows?

My bad, I actually meant connected as in the conn being set on the peer struct from manager.connect() and the log statement "connected" emitted for the peer. I felt this could be semi-lying compared to the previous impl where we would fail right away.

But I just looked at the observer server, and it does filter on the state, and I guess the peer will correctly report a failed state on first rpc and be filtered out?

I think I'll move this back to draft and play a bit with this plus update/add tests to see how it behaves.

@devodev devodev marked this pull request as draft November 19, 2024 01:30
@devodev devodev force-pushed the pr/devodev/hubble-grpc-dialcontext-deprecated branch 2 times, most recently from acaa1b7 to 39919d3 Compare November 19, 2024 15:57
@devodev
Copy link
Contributor Author

devodev commented Nov 19, 2024

Alright I must admit I made my life more difficult for no reason. I updated the PR and removed the WaitForReady logic which after testing does not seem necessary at all.

@devodev devodev marked this pull request as ready for review November 19, 2024 16:45
@devodev devodev requested a review from chancez November 19, 2024 17:00
@chancez
Copy link
Contributor

chancez commented Nov 19, 2024

/test

@devodev devodev force-pushed the pr/devodev/hubble-grpc-dialcontext-deprecated branch from 39919d3 to 0628988 Compare November 19, 2024 18:01
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a little docs comment

@chancez
Copy link
Contributor

chancez commented Nov 19, 2024

I'm a bit unsure of the attempts to retain previous behavior regarding connecting to peers eagerly, though I'm still looking at the options we have available. I'm also a bit concerned about the fact that Connect() is experimental.
I'm also not sure if this is true:

the manager would report connected right away for all peers that could create a ClientConn before we could confirm if we can actually reach them.

Because at least in some areas it's reporting the underlying connection state, so it would report idle in that case, until an RPC is made.
Can you clarify what you mean? Perhaps you could show what hubble list nodes reports with and without the extra logic? Maybe also what the hubble list nodes reports before and after an hubble observe flows?

My bad, I actually meant connected as in the conn being set on the peer struct from manager.connect() and the log statement "connected" emitted for the peer. I felt this could be semi-lying compared to the previous impl where we would fail right away.

But I just looked at the observer server, and it does filter on the state, and I guess the peer will correctly report a failed state on first rpc and be filtered out?

I think I'll move this back to draft and play a bit with this plus update/add tests to see how it behaves.

Can you provide the output before/after first RPC?

@devodev
Copy link
Contributor Author

devodev commented Nov 19, 2024

Can you provide the output before/after first RPC?

It says connected for all peers regardless if an RPC has been made or not.

I also tested with hubble how it reacts when a hubble peer disappears:

$ ./hubble list nodes -P --tls --tls-allow-insecure
NAME                                                            STATUS      AGE     FLOWS/S   CURRENT/MAX-FLOWS
hubble-grpc-dialcontext/hubble-grpc-dialcontext-control-plane   Connected   2m54s   1.24      248/4095 (  6.06%)
hubble-grpc-dialcontext/hubble-grpc-dialcontext-worker          Connected   7m54s   17.47     4095/4095 (100.00%)

# kill cilium node

$ ./hubble list nodes -P --tls --tls-allow-insecure
NAME                                                            STATUS      AGE     FLOWS/S   CURRENT/MAX-FLOWS
hubble-grpc-dialcontext/hubble-grpc-dialcontext-control-plane   Connected   2m55s   1.29      260/4095 (  6.35%)
hubble-grpc-dialcontext/hubble-grpc-dialcontext-worker          Error       N/A     N/A       N/A

$ ./hubble list nodes -P --tls --tls-allow-insecure
NAME                                                            STATUS        AGE     FLOWS/S   CURRENT/MAX-FLOWS
hubble-grpc-dialcontext/hubble-grpc-dialcontext-control-plane   Connected     2m56s   1.31      265/4095 (  6.47%)
hubble-grpc-dialcontext/hubble-grpc-dialcontext-worker          Unavailable   N/A     N/A       N/A

$ ./hubble list nodes -P --tls --tls-allow-insecure
NAME                                                            STATUS      AGE     FLOWS/S   CURRENT/MAX-FLOWS
hubble-grpc-dialcontext/hubble-grpc-dialcontext-control-plane   Connected   3m11s   1.48      318/4095 (  7.77%)
hubble-grpc-dialcontext/hubble-grpc-dialcontext-worker          Connected   6s      25.67     205/4095 (  5.01%)

@devodev devodev force-pushed the pr/devodev/hubble-grpc-dialcontext-deprecated branch from 0628988 to c148f6f Compare November 19, 2024 23:38
Copy link
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sig-k8s LGTM

@kaworu kaworu added release-note/misc This PR makes changes that have no direct user impact. sig/hubble labels Nov 20, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 20, 2024
Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @devodev 👍

Overall patch LGTM but I'll rather keep the CLI flag, see my comment on the serve.go file.

@devodev devodev force-pushed the pr/devodev/hubble-grpc-dialcontext-deprecated branch from c148f6f to bcc0033 Compare November 20, 2024 19:30
@chancez
Copy link
Contributor

chancez commented Nov 20, 2024

Can you provide the output before/after first RPC?

It says connected for all peers regardless if an RPC has been made or not.

I also tested with hubble how it reacts when a hubble peer disappears:

$ ./hubble list nodes -P --tls --tls-allow-insecure
NAME                                                            STATUS      AGE     FLOWS/S   CURRENT/MAX-FLOWS
hubble-grpc-dialcontext/hubble-grpc-dialcontext-control-plane   Connected   2m54s   1.24      248/4095 (  6.06%)
hubble-grpc-dialcontext/hubble-grpc-dialcontext-worker          Connected   7m54s   17.47     4095/4095 (100.00%)

# kill cilium node

$ ./hubble list nodes -P --tls --tls-allow-insecure
NAME                                                            STATUS      AGE     FLOWS/S   CURRENT/MAX-FLOWS
hubble-grpc-dialcontext/hubble-grpc-dialcontext-control-plane   Connected   2m55s   1.29      260/4095 (  6.35%)
hubble-grpc-dialcontext/hubble-grpc-dialcontext-worker          Error       N/A     N/A       N/A

$ ./hubble list nodes -P --tls --tls-allow-insecure
NAME                                                            STATUS        AGE     FLOWS/S   CURRENT/MAX-FLOWS
hubble-grpc-dialcontext/hubble-grpc-dialcontext-control-plane   Connected     2m56s   1.31      265/4095 (  6.47%)
hubble-grpc-dialcontext/hubble-grpc-dialcontext-worker          Unavailable   N/A     N/A       N/A

$ ./hubble list nodes -P --tls --tls-allow-insecure
NAME                                                            STATUS      AGE     FLOWS/S   CURRENT/MAX-FLOWS
hubble-grpc-dialcontext/hubble-grpc-dialcontext-control-plane   Connected   3m11s   1.48      318/4095 (  7.77%)
hubble-grpc-dialcontext/hubble-grpc-dialcontext-worker          Connected   6s      25.67     205/4095 (  5.01%)

Hmm, I was hoping it would report "idle" or "unknown" until we've actually established a connection, after which it would report Connected/Unavailable.

Upon investigation, it looks liket it sets the state to Connected based on the gRPC state not being Shutdown/TransientError. However, it then goes onto make a ServerStatus RPC, so I guess as long as that succeeds, it is in a connected state, so I think it's working as expected.

@devodev devodev force-pushed the pr/devodev/hubble-grpc-dialcontext-deprecated branch from bcc0033 to eed7b1d Compare November 21, 2024 15:20
@chancez
Copy link
Contributor

chancez commented Nov 21, 2024

/test

@devodev devodev requested review from a team as code owners November 21, 2024 17:35
@devodev devodev requested review from ldelossa and kaworu November 21, 2024 17:35
Copy link
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chancez
Copy link
Contributor

chancez commented Nov 25, 2024

/test

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
@devodev devodev force-pushed the pr/devodev/hubble-grpc-dialcontext-deprecated branch from 28c2bbe to 10e28c9 Compare November 26, 2024 18:32
@chancez
Copy link
Contributor

chancez commented Nov 27, 2024

/test

Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @devodev, patch LGTM now. Please open an issue to address the upcoming removal of both the .Values.hubble.relay.dialTimeout Helm value along with the --dial-timeout flag of Hubble Relay so we don't forget about them.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 28, 2024
@devodev
Copy link
Contributor Author

devodev commented Nov 28, 2024

Thank you @devodev, patch LGTM now. Please open an issue to address the upcoming removal of both the .Values.hubble.relay.dialTimeout Helm value along with the --dial-timeout flag of Hubble Relay so we don't forget about them.

Here we go: #36240

@julianwiedmann julianwiedmann added this pull request to the merge queue Nov 28, 2024
Merged via the queue into cilium:main with commit b1822e1 Nov 28, 2024
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/community-contribution This was a contribution made by a community member. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warnings about Hubble Relay connectivity issues Replace deprecated gRPC Dial options in Hubble sub-systems
7 participants