Skip to content

Conversation

liyihuang
Copy link
Contributor

Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment

Fixes: #36428

lower log severity for stale metadata to avoid CI issue

@liyihuang liyihuang added the backport/1.18 This PR represents a backport for Cilium 1.18.x of a PR that was merged to main. label Aug 26, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 26, 2025
@liyihuang liyihuang changed the title pkg/aws: Lower log level from warn to info for AWS data fetches Lower log level for Cloud/AWS data fetches Aug 26, 2025
@liyihuang liyihuang added release-note/misc This PR makes changes that have no direct user impact. release-note/ci This PR makes changes to the CI. labels Aug 26, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 26, 2025
@liyihuang
Copy link
Contributor Author

/test

@liyihuang liyihuang marked this pull request as ready for review August 27, 2025 02:01
@liyihuang liyihuang requested a review from a team as a code owner August 27, 2025 02:01
@liyihuang liyihuang requested review from pippolo84 and gandro and removed request for gandro August 27, 2025 02:01
@liyihuang
Copy link
Contributor Author

@gandro just mention you here so you are aware of this PR

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One bit of feedback

Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

LGTM minus Sebastian's suggestion, waiting for the comments to be improved for the final ack. ✅

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 27, 2025
@pippolo84 pippolo84 removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 27, 2025
@liyihuang liyihuang force-pushed the pr/liyi/lower_eni_logs branch from aa2a0b0 to 5505316 Compare August 27, 2025 15:25
@liyihuang liyihuang requested a review from pippolo84 August 27, 2025 15:25
@liyihuang
Copy link
Contributor Author

/test

Lowering the log severity from warn to info for cases where stale data is
received from AWS. As discussed in cilium#36428
and cilium#41278, stale metadata can cause
temporary IP calculation issues, but the operator is expected to eventually
reconcile correctly. These warnings have been observed to cause CI failures,
despite being ok in Datadog's environment

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
@liyihuang liyihuang force-pushed the pr/liyi/lower_eni_logs branch from 5505316 to b8ff65d Compare August 28, 2025 02:03
@gandro gandro added needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch and removed backport/1.18 This PR represents a backport for Cilium 1.18.x of a PR that was merged to main. labels Aug 28, 2025
@gandro
Copy link
Member

gandro commented Aug 28, 2025

/test

@liyihuang liyihuang added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 28, 2025
@joestringer
Copy link
Member

joestringer commented Aug 28, 2025

I'll merge this as it looks like it will help with CI stability, but I do wonder whether we should iterate on the messages being provided to the user here. These will be "info" level logs but if I look at a message like Unable to create interface on instance, I don't think it expresses to the user very clearly what happens next. It sounds like an error but there's no indication of whether Cilium will take action to retry/mitigate or whether Cilium expects the user to take action for instance. This is minor feedback but could help to assure users who encounter this scenario, and reduce potential future bug reports due to users being confused what it means and what they should do about it.

@joestringer joestringer added this pull request to the merge queue Aug 28, 2025
Merged via the queue into cilium:main with commit 347c8fc Aug 28, 2025
68 checks passed
@liyihuang
Copy link
Contributor Author

I'll merge this as it looks like it will help with CI stability, but I do wonder whether we should iterate on the messages being provided to the user here. These will be "info" level logs but if I look at a message like Unable to create interface on instance, I don't think it expresses to the user very clearly what happens next. It sounds like an error but there's no indication of whether Cilium will take action to retry/mitigate or whether Cilium expects the user to take action for instance. This is minor feedback but could help to assure users who encounter this scenario, and reduce potential future bug reports due to users being confused what it means and what they should do about it.

I totally understand where you concern comes from.

I'm not sure if you got any time to take a look at the discussion in #41278.

Here is the summary.

we think this is an error casued by AWS providing eventually consistent so we incorrectly create the new ENI or assign IP address. in datadog environment, they mentioned the following.

I did manage to find such logs, but they are extremely rare events (I found 13 PrivateIpAddressLimitExceeded errors out of 492,374 AssignPrivateIpAddresses events) and have no significative impact as the operator does eventually reconcile the state as described in the issue.

From user perspective, I think it is an error but it will get to healthy status within 1 min(IPAM has periodically sync with AWS). Hopefully users will not expereince the issue to look into this message.

I could look into it in the future if time allows and the fix doesn't complicate the code too much.

@viktor-kurchenko viktor-kurchenko added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Sep 2, 2025
@viktor-kurchenko viktor-kurchenko mentioned this pull request Sep 2, 2025
4 tasks
@viktor-kurchenko viktor-kurchenko added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Sep 2, 2025
@viktor-kurchenko viktor-kurchenko added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Sep 2, 2025
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Conformance EKS - no-errors-in-logs - Unable to assign additional IPs to interface, will create new interface
5 participants