-
Notifications
You must be signed in to change notification settings - Fork 3.4k
pkg/aws: check ENI ips with AWS #41278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
09f5885
to
7b84c44
Compare
/test |
Added a new function `waitForENIAddressState` to handle IP address state verification on ENI using an exponential backoff mechanism. This improves the reliability of IP address assignments and removals by ensuring that the operations are confirmed with AWS before proceeding. Updated the `AssignPrivateIpAddresses` and `UnassignPrivateIpAddresses` methods to utilize this new function for better error handling and state management. Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
7b84c44
to
7b7b3aa
Compare
/test |
/ci-eks |
7 similar comments
/ci-eks |
/ci-eks |
/ci-eks |
/ci-eks |
/ci-eks |
/ci-eks |
/ci-eks |
I have ran more than 50 times for ci-eks and dont see the issue again. I have a script keep https://github.com/cilium/cilium/actions/runs/17101497696/job/48682108893 running for last 3 days. I only see a few failure(one ipsec another 2-3 related to quay.io ). Hopefully this will fix the issue. if this get merged to main, I will keep monitoring the CI failure then make other changes to resolve this type of issue for AWS and other clouds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this new behavior behind a new operator flag so it can be selectively enabled or disabled? With the current implementation this will cause at least one new DescribeNetworkInterfaces
call any time IPs are assigned or unassigned from an ENI which is quite costly in terms of API rate limiting, when the operator already makes a fairly high volume of DescribeNetworkInterfaces
calls.
From looking at cloudtrail logs, it looks like on average the volume of assign/unassign calls made by the operator represents around 40% of the volume of describeENI calls. So that PR would lead to an increase of at least 40% of DescribeNetworkInterfaces
calls which is quite significative.
readInput := &ec2.DescribeNetworkInterfacesInput{ | ||
NetworkInterfaceIds: []string{eniID}, | ||
} | ||
readOutput, err := c.ec2Client.DescribeNetworkInterfaces(ctx, readInput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the standard self-ratelimiting and observability wrapper on this API call?
c.limiter.Limit(ctx, DescribeNetworkInterfaces)
sinceStart := spanstat.Start()
readOutput, err := c.ec2Client.DescribeNetworkInterfaces(ctx, readInput)
c.metricsAPI.ObserveAPICall(DescribeNetworkInterfaces, deriveStatus(err), sinceStart.Seconds())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. I have to admit I have some rough edges for this PR where I hope to have discussion to find the best approach to address the issue
Thanks for the input. I dont have the insight to call volume. I have some worry about the flag is because this is a bug fix than a feature. If we feel there are too many calls, do you feel Skip the sync is overall better approach than read after the write? Do you observe #36428 in your environment? |
@HadrienPatte it's not directly related to this issue but could be related to your environment for AWS call, I can see our AWS call is pretty wild since we pull all VPCs/subnets etc by default. In theory, we could figure out the VPC that the operator is sitting and filter all resources by the operator's VPC. Hopefully, that could reduce the number of As I mentioned in the PR, this is just the first attempt to solve the CI issue. if it can help, I'm planning to go through all AWS call and implement the read after write. |
Thank you for your continued work on this ❤️ HadrienPatte already did an excellent review, +1 to all the feedback they gave. I am however also very worried about the additional API overhead this introduces. I've left a more in-depth response in the issue (#36428 (comment)):
Of course, the "skipping the sync" solution be the highest benefit to the user, but if it is not worth the complexity, then I would consider lowering the severity of the log message. |
I will look into skipping the sync to figure out the efforts. The main reason for me to push the read after write commit is because it's easiest for me to implement it, and I also saw that Azure's API has the poller and I guess it's doing this type of reading(https://github.com/cilium/cilium/blob/main/pkg/azure/api/api.go#L633) The main concern was logic change in ipam will affect all clouds and it takes longer for me to understand the codebase. |
Right. Note that this is a different kind of polling - this just waits for the API to complete the operation (because it seem Azure is async by default). It does not verify if the operation is also visible in state returned by API calls that describe the IPs assigned to an interface. Because the latter is a much more expensive operation, it likely has different load and performance characteristics than polling the operation state (but I have not confirmed if this is true).
Maybe a potential solution would be to reduce the log severity from warning to info for now, giving you more time to dive into the rest of the codebase? Once we have a proper fix, we can make it a warning again. |
I did manage to find such logs, but they are extremely rare events (I found 13
We have many clusters, with varying level of pod churn, some using prefix delegation and others not using it. Overall my personal take is that even though there is a non nominal behavior here, the potential cost of fixing it when compared to the impact it has today would not justify fixing it in this way. Coming back to the initial issue raised in #36428 which is about the CI flakiness that is induced by this behavior, I think an acceptable solution to that issue could be to lower the severity of these events (which are already only logged as warnings) or update the CI to tolerate them so they don't make the CI fail anymore (which can be configured here I believe). |
Thanks for those production data and it's glad you have those event but overall still working fine. My concern was the stale data from AWS is just our assumption, it's hard to know the impact of the issue that we observed in CI. @gandro with those data from datadog, do you think we should modify our code(I'm even thinking about revert my previous PR #37650) or just lower the severity of the log? Based on my previous PR related to skipping the sync, here are the areas where changes would be required:
I'm happy to take this on—I just wanted to outline the additional code changes required so we can evaluate whether this approach is worthwhile overall. I also noticed another EKS CI issue: #37948, which seems to be caused by stale data as well, at least based on my initial review. Before working on this, I reviewed the AWS CNI code and spoke with a maintainer of Alibaba Cloud’s CNI. Neither implementation includes this kind of sync logic. In fact, the AWS CNI queries the instance metadata server to verify IP addresses directly. I’ve tried to gather and lay out everything I’ve learned so far—hopefully you can help me assess what would be the best path forward. @HadrienPatte also appreciate it if you have some insights. |
Thank you @HadrienPatte for sharing your insights, very useful! @liyihuang I agree, let's lower the severity for now (which is also a change we can easily backport). I would keep #37650 as I don't think it causes any overhead on the API and I do believe it does reduce the likelihood of us attempting to over-allocate. What would be the reason you want to revert it?
I believe this is a different class of error, since this happens on the agent (after the operator has told it that it can use a certain ENI). In that race, the ENI is attached on the API (and presumably also on the underlying instance), visible in the sync, but the local Linux kernel has not yet discovered the interface. This I think is just a race that probably is fixed by having the agent wait for the Linux network interface to appear. I think this is also an easier fix to implement than the stale API.
Thanks for outlining the needed work. I am a bit unsure how to tackle this as well. In particular I am unsure if we really want to completely get rid of the sync - or if we only want to restructure the code in a way that it can correctly deal with stale information (i.e. detect stale state if certain operations fail and then wait for the sync to happen before continuing). My main worry about not doing any syncs at all (if that is what you are suggesting) is that it can put the operator in a non-recoverable state where it has wrong information (e.g. because of an operator accounting bug, or because a user has modified an ENI manually) and keeps doing actions based on wrong information, without any chance of the information being corrected. |
My main reason is to keep the same logic across other operations and make the logic simple
I also have to admit my direct update did resolved my previous PR issue(I dont understand why/how it resolved it now if the stale data will override the local anyway)
I'm not sure if I fully understand the whole thing, we have the operator to pull the AWS data every 1 min for everything cilium/pkg/ipam/node_manager.go Lines 224 to 257 in cb307a5
To me, skipping this sync is more like delaying the sync to sometime up to 1 min later. it doesn't sound that scary. Do I misunderstand it? I can also look into AWS API to see if we can guess we are getting the stale data or we have to put more logic to determine it(for example, we know we just assign the IP and our internal IP is more than what we got from AWS, then we can guess AWS data is stale and not to override it) |
In some cases I think (unless I'm missing something) we'll do another operation using the local before the next sync. If that happens, we (temporarily) avoid working with stale data. I really don't feel strongly about #37650 - if you prefer to revert it, please go ahead (that is, once we've lowered the severity).
This is fine. If that is the plan I have no concerns! |
I will take another look for this, it's been long time that I may forget the details. I also dont have strong desire to revert it. I just feel we should keep our write operation to use the same logic and dont want my code confuse others in the future. I think here are following steps that I will perform
@gandro do you think this is the correct plan or we dont have to do step 2->4 once we lower the severity since we will eventually reconcile to the right state? |
Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in cilium#36428 and cilium#41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in cilium#36428 and cilium#41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in cilium#36428 and cilium#41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
#41389 opened to lower the severity |
By "direct update" you mean expanding the approach taken in #37650? I think we really should do 1 because it's easy and reduces CI issues. As for 2-4 I think they are nice to have. We should do them if we (you) have the cycles and it doesn't complicate the code too much. |
Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in cilium#36428 and cilium#41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
yes.
I will take a look if time allows. I'm asking mainly to understand what the right balance is. |
Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in cilium#36428 and cilium#41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
[ upstream commit 347c8fc ] Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com>
[ upstream commit 347c8fc ] Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com>
[ upstream commit 347c8fc ] Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com>
[ upstream commit 347c8fc ] Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com>
[ upstream commit 347c8fc ] Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com>
[ upstream commit 347c8fc ] Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com>
[ upstream commit 347c8fc ] Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com>
This is a fix for ENI mode to avoid syncing stale data from AWS. We want to evaluate whether this change can address the CI issue described in #36428, based on the discussion here.
If this proves effective, we plan to apply similar changes to other cloud providers (AWS/Alibaba/Azure) to address this class of issues.
We hope to test this fix for 2–4 weeks to determine whether the CI issue reoccurs. If the issue no longer appears, I will follow up with a PR to apply the same approach to other cloud APIs.
Please let me know if there's a better way I should approach this.
Approaches considered
1. Read-after-write
Since AWS APIs are eventually consistent, we could retry and fetch the updated data using
DescribeNetworkInterfaces
here after performing a write, to ensure we get the most recent state.Concerns:
NewDescribeInstancesPaginator
. It’s unclear whetherNewDescribeInstancesPaginator
might still return stale data even ifDescribeNetworkInterfaces
returns the updated state.2. Skip the sync
Since we already update our internal cache in PR #37650, we could skip syncing from AWS in this case.
I reviewed the AWS CNI code and found that it similarly updates its internal cache without re-syncing from AWS:
https://github.com/aws/amazon-vpc-cni-k8s/blob/707708191e2dcb7ebe1a642a119b4e01afa9f29f/pkg/ipamd/ipamd.go#L1092-L1149
After experimenting with PR #41140, I found that the read-after-write approach is easier to implement. Our IPAM code has more complex sync timing logic, and skipping the sync would require broader changes.
Fixes: #36428