Skip to content

Conversation

chancez
Copy link
Contributor

@chancez chancez commented Oct 30, 2024

When connecting to the hubble peer manager ensure we return the underlying connection error so it's easier to diagnose connection related problems.

Currently the only error returned is the context timeout:

time="2024-09-18T21:13:13Z" level=warning msg="Failed to create peer client for peers synchronization; will try again after the timeout has expired" error="context deadline exceeded" subsys=hubble-relay target="hubble-peer.kube-system.svc.cluster.local.:443"

This will ensure the underlying connection level error is returned when the context is cancelled.

In the future we should switch to not using grpc.WithBlock at all which avoids many of these problems, but that requires more testing. In the short-term, lets set these dial options to improve things now, in-case the switch to non-blocking dials is delayed.

hubble-relay: Return underlying connection errors when connecting to peer manager

I'm marking for backport to v1.15 and 1.16 because the change is small and will be a huge improvement in errors that users see returned in Hubble Relay logs when they have an issue with Hubble Relay connecting to cilium agent.

…Error for peer manager client

When connecting to the hubble peer manager ensure we return the
underlying connection error so it's easier to diagnose connection
related problems.

Currently the only error returned is the context timeout:

```
time="2024-09-18T21:13:13Z" level=warning msg="Failed to create peer client for peers synchronization; will try again after the timeout has expired" error="context deadline exceeded" subsys=hubble-relay target="hubble-peer.kube-system.svc.cluster.local.:443"
```

This will ensure the underlying connection level error is returned when
the context is cancelled.

In the future we should switch to not using grpc.WithBlock at all which
avoids many of these problems, but that requires more testing. In the
short-term, lets set these dial options to improve things now, in-case
the switch to non-blocking dials is delayed.

Signed-off-by: Chance Zibolski <chance.zibolski@gmail.com>
@chancez chancez added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. needs-backport/1.15 needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Oct 30, 2024
@chancez chancez self-assigned this Oct 30, 2024
@chancez chancez requested a review from a team as a code owner October 30, 2024 00:01
@chancez chancez requested a review from rolinh October 30, 2024 00:01
Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure users will greatly appreciate the change 🙂

@rolinh
Copy link
Member

rolinh commented Oct 30, 2024

/test

@rolinh rolinh enabled auto-merge October 30, 2024 08:52
@rolinh rolinh self-requested a review October 30, 2024 09:07
@rolinh rolinh disabled auto-merge October 30, 2024 09:07
Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, got confused with the ClientConnBuilder for a moment (vs ClientBuilder). So in effect, we've been using these options to connect to Hubble peers for a long time but never did for the connection to the peer manager itself. All good, let's merge this change.

@rolinh rolinh enabled auto-merge October 30, 2024 09:12
@rolinh rolinh added this pull request to the merge queue Oct 30, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 30, 2024
Merged via the queue into main with commit 0b8beea Oct 30, 2024
277 of 278 checks passed
@rolinh rolinh deleted the pr/chancez/relay_peer_errors branch October 30, 2024 10:04
@joamaki joamaki mentioned this pull request Nov 5, 2024
4 tasks
@joamaki joamaki added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 labels Nov 5, 2024
@joamaki joamaki mentioned this pull request Nov 5, 2024
23 tasks
@joamaki joamaki added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Nov 5, 2024
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants