Skip to content

Conversation

ungureanuvladvictor
Copy link
Member

The code is logging when this is happening but it would be useful to have them exposed as metrics to put them on a dashboard and to correlate with other events in the system.

Add counter to track all datapath timeouts due to FQDN IP updates

Tagging @christarazi for SA.

@ungureanuvladvictor ungureanuvladvictor requested review from a team as code owners May 13, 2022 02:58
@ungureanuvladvictor ungureanuvladvictor requested review from a team, sayboras and qmonnet May 13, 2022 02:58
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 13, 2022
@ungureanuvladvictor ungureanuvladvictor requested a review from jibi May 13, 2022 02:58
@ungureanuvladvictor ungureanuvladvictor added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. labels May 13, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 13, 2022
@ungureanuvladvictor
Copy link
Member Author

/test

@christarazi christarazi added the area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. label May 13, 2022
@ungureanuvladvictor
Copy link
Member Author

/test-gke

@ungureanuvladvictor
Copy link
Member Author

/ci-nat46x64

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment on metric metrics to conform with promlint, the rest looks good to me.

GHA conformance test with metric validation might not catch such issue, as unlikely we have all possible metrics, the lack of unit test on this will be tracked in another issue.

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels May 15, 2022
@sayboras sayboras removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 15, 2022
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked as request to change so that maintainer bot will not add ready to merge till small change in metric name is done.

Thanks.

@ungureanuvladvictor
Copy link
Member Author

ungureanuvladvictor commented May 15, 2022

/test

@sayboras -- should be ready for another pass.

Job 'Cilium-PR-K8s-1.23-kernel-net-next' failed:

Click to show.

Test Name

K8sServicesTest Checks device reconfiguration Detects newly added device and reloads datapath

Failure Output

FAIL: Timed out after 60.001s.

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-net-next so I can create one.

@ungureanuvladvictor
Copy link
Member Author

ungureanuvladvictor commented May 15, 2022

/test-1.23-net-next

Job 'Cilium-PR-K8s-1.23-kernel-net-next' failed:

Click to show.

Test Name

K8sServicesTest Checks device reconfiguration Detects newly added device and reloads datapath

Failure Output

FAIL: Timed out after 60.000s.

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-net-next so I can create one.

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and LGTM 💯

@sayboras
Copy link
Member

Marking this ready to merge, as the tests were successfully ran before minor change with metric name.

@sayboras sayboras added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 16, 2022
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you squash the commits? No need for them to be separate. Thanks!

@christarazi christarazi removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 16, 2022
Signed-off-by: Vlad Ungureanu <ungureanuvladvictor@gmil.com>
@ungureanuvladvictor ungureanuvladvictor force-pushed the vladu/metric-l7-datapath-timeout branch from 30df9a6 to d771086 Compare May 24, 2022 05:28
@ungureanuvladvictor
Copy link
Member Author

ungureanuvladvictor commented May 24, 2022

/test

@christarazi -- squashed + triggered the tests.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 24, 2022
@jibi jibi merged commit 2926892 into cilium:master May 25, 2022
@ungureanuvladvictor ungureanuvladvictor deleted the vladu/metric-l7-datapath-timeout branch May 25, 2022 15:21
@rahulkjoshi
Copy link
Contributor

Can we consider this PR for backporting? Would be very nice to have, and doesn't seem to depend on anything new.

@christarazi
Copy link
Member

I think it's a good idea for debugging / observability, so why not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants