Skip to content

Conversation

davchos
Copy link
Contributor

@davchos davchos commented Aug 9, 2024

Remove deprecated call to DialContext in Hubble server

Related:#32274, #33065

Signed-off-by: David Chosrova dchosrova@gmail.com

@davchos davchos requested a review from a team as a code owner August 9, 2024 07:36
@davchos davchos requested a review from tklauser August 9, 2024 07:36
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 9, 2024
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Aug 9, 2024
@davchos
Copy link
Contributor Author

davchos commented Aug 9, 2024

Hi @tklauser,

I have 2 questions, before pushing this PR, I have used make kind / make kind-image-fast / make kind-install-cilium-fast
to test if everything was OK. I don't understand why the import not used has not been catched.

I have sent a commit to remove the unused import, the check is still failing, so is it best now to close this PR and re submit a new one.

Thanks in advance for your help.

David

@kaworu kaworu requested review from glrf and removed request for tklauser August 9, 2024 12:45
@kaworu kaworu added release-note/misc This PR makes changes that have no direct user impact. sig/hubble labels Aug 9, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 9, 2024
Copy link
Contributor

@glrf glrf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi David thanks for the contribution!

For you're two questions:

  1. The reason why make kind-image-fast didn't catch the issue is because it doesn't actually build the relay image. So it wasn't actually compiling your code changes. Sadly, the Hubble Relay development isn't quite a streamlined, it's not even enabled by default when you use make kind / make kind-image-fast / make kind-install-cilium-fast.

    To actually run your Relay changes you need to build the image yourself and enable relay by running the following after you created the kind cluster

DOCKER_IMAGE_TAG=local make docker-hubble-relay-image
kind load docker-image quay.io/cilium/hubble-relay:local
cat << EOF > contrib/testing/kind-custom.yaml
hubble:
  relay:
    enabled: true
    image:
      override: quay.io/cilium/hubble-relay:local
      pullPolicy: Never
EOF
make kind-install-cilium-fast
  1. The reason CI is still failing is that we actually require that every commit builds successfully, not just the last commit. You can just squash your 4 commits into a single commit and CI should be happy again.

On a first look your changes look good, but I think we can actually get rid of the DialTimeout and the related option and flag as it doesn't do anything anymore (and then even remove it from the helm chart)

I still want take a closer look how this will impact the peer manager though, just to be sure we don't break anything unexpectedly by removing this timeout. But I won't get to that today, so I'll give you another review on Monday.

@davchos davchos force-pushed the pr/Remove-deprecated-call--grpc.DialContext branch 2 times, most recently from fcca79b to 3b1f5ba Compare August 12, 2024 06:49
@glrf
Copy link
Contributor

glrf commented Aug 12, 2024

Alright, I took another look and I think removing this timeout is fine. It won't change anything for the connection to the peers and connection retries to the peer service will have a slightly different timing, because we don't have the same timeout when dialing on the first call, but that should be fine.

So let's just remove the unused DialTimeout fields and options and the corresponding flag and this should be good.

@davchos
Copy link
Contributor Author

davchos commented Aug 12, 2024

Alright, I took another look and I think removing this timeout is fine. It won't change anything for the connection to the peers and connection retries to the peer service will have a slightly different timing, because we don't have the same timeout when dialing on the first call, but that should be fine.

So let's just remove the unused DialTimeout fields and options and the corresponding flag and this should be good.

Hi Fabian,

I will remove the timeout, not available today I do it tomorrow. Have a good day.

David

@davchos davchos requested review from a team as code owners August 13, 2024 08:16
@davchos davchos requested a review from squeed August 13, 2024 08:16
@davchos davchos force-pushed the pr/Remove-deprecated-call--grpc.DialContext branch 2 times, most recently from a9119fa to c7d591b Compare August 13, 2024 08:28
@davchos
Copy link
Contributor Author

davchos commented Aug 13, 2024

No idea why this test is failing (Conformance K8s Upstream Network / Installation and Conformance Test (ipv4)), the cilium installation looks ok, Hubble relay not enabled.

Copy link
Contributor

@glrf glrf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final nitpick, otherwise this LGTM

And the CI failure seems like an unrelated flake.

@davchos davchos force-pushed the pr/Remove-deprecated-call--grpc.DialContext branch 3 times, most recently from 358f67b to 9af47e9 Compare August 13, 2024 15:28
Copy link
Contributor

@glrf glrf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@glrf
Copy link
Contributor

glrf commented Aug 14, 2024

/test

@davchos
Copy link
Contributor Author

davchos commented Aug 14, 2024

@glrf, I have checked the failed checks, don't "think" (hope) it is related to the changes I have done.

David

@glrf
Copy link
Contributor

glrf commented Aug 15, 2024

Sadly, that one does seem related: https://github.com/cilium/cilium/actions/runs/10382885529/job/28746842560

We introduce a small issue during startup. When the DNS server is not (yet) available, DialContext blocked until the connections was ready or the timeout was reached. Now we do the resolving during the Notify call, which will instantly fail, assuming we'll retry ourselves. Which we'll do, but we'll log a warning. This means on a clean Cilium installation Hubble Relay will always log a warning. Which isn't horrible, but also not great and what currently fails CI.

IMO for now it would be fine to just drop the severity of that log message from Warning to Info. We're actively retrying and it will eventually succeed and is more or less what happened before, we just didn't log anything for 30 seconds.

Longer term we might want to take a closer look at the Peer service connection and how we handle retries and make something smarter than just tearing down everything and retrying in 30 seconds. But I think that's out of scope of this PR and that's something I wanted to do for a while now anyways.

@davchos
Copy link
Contributor Author

davchos commented Aug 15, 2024

Sadly, that one does seem related: https://github.com/cilium/cilium/actions/runs/10382885529/job/28746842560

We introduce a small issue during startup. When the DNS server is not (yet) available, DialContext blocked until the connections was ready or the timeout was reached. Now we do the resolving during the Notify call, which will instantly fail, assuming we'll retry ourselves. Which we'll do, but we'll log a warning. This means on a clean Cilium installation Hubble Relay will always log a warning. Which isn't horrible, but also not great and what currently fails CI.

IMO for now it would be fine to just drop the severity of that log message from Warning to Info. We're actively retrying and it will eventually succeed and is more or less what happened before, we just didn't log anything for 30 seconds.

Longer term we might want to take a closer look at the Peer service connection and how we handle retries and make something smarter than just tearing down everything and retrying in 30 seconds. But I think that's out of scope of this PR and that's something I wanted to do for a while now anyways.

Thanks for the explanation, I'll change the log level.

@ldelossa ldelossa requested a review from learnitall August 21, 2024 14:12
@maintainer-s-little-helper
Copy link

Commit 84cf6c6 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 21, 2024
@davchos
Copy link
Contributor Author

davchos commented Aug 21, 2024

Hi,

After running locally the "make -C bpf checkpatch" I get errors, "Warning: WARNING:BAD_SIGN_OFF: Duplicate signature".

Maybe it is better to close this PR and re create a clean one ?

David

Copy link
Contributor

@learnitall learnitall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a non-blocking nit, but LGTM. Thank you!

@maintainer-s-little-helper
Copy link

Commit 84cf6c6 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@glrf
Copy link
Contributor

glrf commented Aug 22, 2024

Maybe it is better to close this PR and re create a clean one ?

That shouldn't be necessary. I think the issue is related to the merge commits from merging main into this branch. At least the last merge commit isn't signed off which is definitely an issue and I think our tooling in general assumes that we rebase instead of merge to update branches.

Could you please drop your two merge commits and then rebase on main? That should fix the issue.

@davchos davchos force-pushed the pr/Remove-deprecated-call--grpc.DialContext branch from 85818bb to b32b9c4 Compare August 22, 2024 09:23
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 22, 2024
@davchos davchos force-pushed the pr/Remove-deprecated-call--grpc.DialContext branch from aeb9897 to b23dfb7 Compare August 22, 2024 10:03
@glrf
Copy link
Contributor

glrf commented Aug 23, 2024

/test

@davchos davchos force-pushed the pr/Remove-deprecated-call--grpc.DialContext branch from deecdc6 to 8bdf10c Compare August 23, 2024 11:53
Related:cilium#32274

Co-authored-by: Ryan Drew <learnitall0@gmail.com>
Signed-off-by: davchos <dchosrova@gmail.com>
@davchos davchos force-pushed the pr/Remove-deprecated-call--grpc.DialContext branch from 8bdf10c to 683210d Compare August 27, 2024 12:15
@glrf
Copy link
Contributor

glrf commented Aug 27, 2024

/test

@davchos
Copy link
Contributor Author

davchos commented Aug 27, 2024

@glrf , @squeed, @learnitall @kaworu,
Thanks for walking me through this commit, I really appreciate

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 9, 2024
@kaworu kaworu added dont-merge/blocked Another PR must be merged before this one. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Oct 9, 2024
@kaworu
Copy link
Member

kaworu commented Oct 9, 2024

Patch LGTM, thanks @glrf for following up and reviewing. Added the dont-merge/blocked label as IIUC the author still wishes to investigate #34261 (comment).

@derailed
Copy link
Contributor

@davchos Could you address the conflicts when you get a chance? Thank you!

@davchos
Copy link
Contributor Author

davchos commented Nov 4, 2024

@davchos Could you address the conflicts when you get a chance? Thank you!

I try to find some time this week.

@kaworu kaworu added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Nov 4, 2024
Signed-off-by: David Chosrova <dchosrova@gmail.com>
@davchos
Copy link
Contributor Author

davchos commented Nov 7, 2024

Well I think it is best to close this PR and work on a new one. If I find time I will try to resubmit a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont-merge/blocked Another PR must be merged before this one. dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. kind/community-contribution This was a contribution made by a community member. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants