Skip to content

Conversation

darox
Copy link
Contributor

@darox darox commented Jul 23, 2024

This commit fixes serveral issues that were
reported in the status commands of cilium-dbg.

  • Fix the issue where not up to 10 unreachable nodes were shown in the status command.
  • Fix the issue where --all-nodes didn't return all nodes regardless of their status.
  • --all-health now returns the health of all modules regardless of their status. This can be paired with --all-nodes to get the health of all nodes.
  • --verbose now shows the latency to nodes and endpoints.

Fixes: #33697

@darox darox requested review from a team as code owners July 23, 2024 14:55
@darox darox requested review from ldelossa and derailed July 23, 2024 14:55
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 23, 2024
@darox darox marked this pull request as draft July 23, 2024 14:56
@darox darox force-pushed the fix-cilium-dgb-health branch from 3a6dcf9 to 1892fe1 Compare July 23, 2024 15:18
@darox darox marked this pull request as ready for review July 24, 2024 09:37
@darox darox force-pushed the fix-cilium-dgb-health branch from 1892fe1 to af3dd44 Compare July 24, 2024 09:39
@joestringer joestringer changed the title cilium-dbg: fix status commands cilium-dbg: fix status commands for cluster connectivity health Jul 24, 2024
@joestringer joestringer added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Jul 24, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 24, 2024
@darox darox marked this pull request as draft July 25, 2024 10:30
@darox
Copy link
Contributor Author

darox commented Jul 25, 2024

Converted back to draft as I work on adding tests for the FormatHealthStatusResponse function.

@darox darox force-pushed the fix-cilium-dgb-health branch 3 times, most recently from 5163aa8 to c6fb670 Compare July 25, 2024 14:21
@darox darox marked this pull request as ready for review July 25, 2024 14:22
Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darox Nice work!

@darox darox requested a review from derailed July 26, 2024 09:27
@darox
Copy link
Contributor Author

darox commented Jul 26, 2024

@derailed I pushed the changes you've requested. I will rebase once you're OK with the changes.

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darox Thank you for the updates!

@marseel
Copy link
Contributor

marseel commented Aug 16, 2024

/test

@marseel
Copy link
Contributor

marseel commented Aug 22, 2024

@darox seems like some tests are failing, would you mind rebasing PR, hopefully it fixes it 😃 ?

@darox
Copy link
Contributor Author

darox commented Sep 6, 2024

@darox seems like some tests are failing, would you mind rebasing PR, hopefully it fixes it 😃 ?

I need a bit more time, because there's now a conflict:

+++ b/pkg/health/client/client.go
@@@ -325,29 -315,8 +325,34 @@@ func formatNodeStatus(w io.Writer, nod
        if localhost {
                localStr = " (localhost)"
        }
++<<<<<<< HEAD
 +      if succinct {
 +              if printAll || !nodeIsHealthy(node) {
 +                      ips := []string{getPrimaryAddressIP(node)}
 +                      for _, addr := range GetHostSecondaryAddresses(node) {
 +                              if addr == nil {
 +                                      continue
 +                              }
 +                              ips = append(ips, addr.IP)
 +                      }
 +                      hostStatuses := SummarizePathConnectivityStatusType(GetAllHostAddresses(node))
 +                      endpointStatuses := SummarizePathConnectivityStatusType(GetAllEndpointAddresses(node))
 +
 +                      fmt.Fprintf(w, "  %s%s\t%s\t%d/%d", node.Name, localStr, strings.Join(ips, ","), hostStatuses[ConnStatusReachable], len(GetAllHostAddresses(node)))
 +                      if hostStatuses[ConnStatusUnknown] > 0 {
 +                              fmt.Fprintf(w, " (%d unknown)", hostStatuses[ConnStatusUnknown])
 +                      }
 +                      fmt.Fprintf(w, "\t%d/%d", endpointStatuses[ConnStatusReachable], len(GetAllEndpointAddresses(node)))
 +                      if endpointStatuses[ConnStatusUnknown] > 0 {
 +                              fmt.Fprintf(w, " (%d unknown)", endpointStatuses[ConnStatusUnknown])
 +                      }
 +                      fmt.Fprintf(w, "\n")
 +              }
 +      } else {
++=======
+

@darox darox force-pushed the fix-cilium-dgb-health branch from 1f5eadc to d170d99 Compare September 11, 2024 12:01
@darox
Copy link
Contributor Author

darox commented Sep 11, 2024

@marseel I rebased and fixed the conflicts, let's see if it goes through.

@darox darox force-pushed the fix-cilium-dgb-health branch from d170d99 to 42379c1 Compare September 11, 2024 12:16
@darox darox force-pushed the fix-cilium-dgb-health branch from 6f10193 to a3369bd Compare October 3, 2024 11:28
@darox
Copy link
Contributor Author

darox commented Oct 3, 2024

I don't see how this PR would affect the failing CI task.

@marseel
Copy link
Contributor

marseel commented Oct 3, 2024

I've rerun test, let's see if it was just a flake (probably)

@darox darox marked this pull request as ready for review October 3, 2024 12:37
@darox
Copy link
Contributor Author

darox commented Oct 3, 2024

I've rerun test, let's see if it was just a flake (probably)

passed

@darox
Copy link
Contributor Author

darox commented Oct 4, 2024

@ldelossa can we get it merged?

@marseel
Copy link
Contributor

marseel commented Oct 4, 2024

/test

1 similar comment
@darox
Copy link
Contributor Author

darox commented Oct 7, 2024

/test

This commit fixes several issues that were
reported in the status commands of cilium-dbg.

- Fix the issue where not up to 10 unreachable
  nodes were shown in the status command.
- Fix the issue where `--all-nodes` didn't
  return all nodes regardless of their status.
- `--all-health` now returns the health of all
  modules regardless of their status. This can be
  paired with `--all-nodes` to get the health of
  all nodes.
- `--verbose` now shows the latency to nodes and
  endpoints.

Signed-off-by: darox <maderdario@gmail.com>
@darox darox force-pushed the fix-cilium-dgb-health branch from a3369bd to 4efb41f Compare October 7, 2024 08:14
@darox
Copy link
Contributor Author

darox commented Oct 7, 2024

/test

@marseel
Copy link
Contributor

marseel commented Oct 8, 2024

@ldelossa would you mind taking a look or should we reassign?

Copy link
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 20, 2024
@ldelossa ldelossa added this pull request to the merge queue Oct 20, 2024
Merged via the queue into cilium:main with commit bf7bc43 Oct 20, 2024
63 checks passed
@tklauser tklauser mentioned this pull request Oct 22, 2024
10 tasks
@tklauser tklauser added the backport/author The backport will be carried out by the author of the PR. label Oct 22, 2024
@tklauser
Copy link
Member

Backport to v1.16 in #35468 lead to non-trivial merge conflicts, this will need a manual backport. Marking as backport/author.

@darox
Copy link
Contributor Author

darox commented Oct 31, 2024

What are the steps I have to take, to get this backported to v1.16?

@giorio94
Copy link
Member

What are the steps I have to take, to get this backported to v1.16?

Hey @darox 👋 Are you still willing to take care of this author backport?

If so, the easiest approach is probably to:

  1. Check out the v1.16 branch
  2. Execute ./contrib/backporting/cherry-pick bf7bc43bd53ca77934513ffd887a59974deca24c
  3. Fix the conflicts
  4. Run git am --continue
  5. Amend the commit to add a description of the addressed conflicts (e.g., ce33255)
  6. Open a PR against the v1.16 branch.
  7. Once merged, change the label of this one to backport-done/v1.16

Feel free to reach out to me if you have questions.

@darox
Copy link
Contributor Author

darox commented Jan 27, 2025

@giorio94 If I will find time, I will take a look.

@darox
Copy link
Contributor Author

darox commented Feb 6, 2025

I had some time to check the code and I think we first need to apply fee6107 before I can apply my commit. My whole commit is based on the former commit. Now, the question is, if we can apply fee6107 because it deprecates some metrics. This commit was first applied on 1.17. I pushed it to #37463 which fixes it. But someone will have to decide, whether that's fine.

@giorio94
Copy link
Member

giorio94 commented Feb 6, 2025

Now, the question is, if we can apply fee6107 because it deprecates some metrics.

That commit introduces a new feature, which is typically not eligible for being backported, and IMO it would come with a significant risk for stable releases. If that commit is a strict requirement to backport this PR, my feeling is that it would be better skipping this backport, considering that it fixes a CLI command that most likely always suffered from this bug.

@darox
Copy link
Contributor Author

darox commented Feb 6, 2025

Now, the question is, if we can apply fee6107 because it deprecates some metrics.

That commit introduces a new feature, which is typically not eligible for being backported, and IMO it would come with a significant risk for stable releases. If that commit is a strict requirement to backport this PR, my feeling is that it would be better skipping this backport, considering that it fixes a CLI command that most likely always suffered from this bug.

Yes, I would agree with that.

@giorio94 giorio94 removed backport/author The backport will be carried out by the author of the PR. needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Feb 6, 2025
@giorio94
Copy link
Member

giorio94 commented Feb 6, 2025

Ack, labels dropped.

@darox darox mentioned this pull request Feb 7, 2025
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cilium-dbg status: Cluster health reporting unintuitive/incorrect results
7 participants