pkg/aws: check ENI ips with AWS #41278

liyihuang · 2025-08-19T15:46:00Z

This is a fix for ENI mode to avoid syncing stale data from AWS. We want to evaluate whether this change can address the CI issue described in #36428, based on the discussion here.

If this proves effective, we plan to apply similar changes to other cloud providers (AWS/Alibaba/Azure) to address this class of issues.

We hope to test this fix for 2–4 weeks to determine whether the CI issue reoccurs. If the issue no longer appears, I will follow up with a PR to apply the same approach to other cloud APIs.

Please let me know if there's a better way I should approach this.

Approaches considered

1. Read-after-write

Since AWS APIs are eventually consistent, we could retry and fetch the updated data using DescribeNetworkInterfaces here after performing a write, to ensure we get the most recent state.

Concerns:

We would need to implement read-after-write logic for all writing APIs.
We also use the EC2 API to fetch ENI info via NewDescribeInstancesPaginator. It’s unclear whether NewDescribeInstancesPaginator might still return stale data even if DescribeNetworkInterfaces returns the updated state.

2. Skip the sync

Since we already update our internal cache in PR #37650, we could skip syncing from AWS in this case.

I reviewed the AWS CNI code and found that it similarly updates its internal cache without re-syncing from AWS:
https://github.com/aws/amazon-vpc-cni-k8s/blob/707708191e2dcb7ebe1a642a119b4e01afa9f29f/pkg/ipamd/ipamd.go#L1092-L1149

After experimenting with PR #41140, I found that the read-after-write approach is easier to implement. Our IPAM code has more complex sync timing logic, and skipping the sync would require broader changes.

Fixes: #36428

Verify eni IP address with aws after performing adding/removing ip from ENI.

liyihuang · 2025-08-19T18:02:02Z

/test

Added a new function `waitForENIAddressState` to handle IP address state verification on ENI using an exponential backoff mechanism. This improves the reliability of IP address assignments and removals by ensuring that the operations are confirmed with AWS before proceeding. Updated the `AssignPrivateIpAddresses` and `UnassignPrivateIpAddresses` methods to utilize this new function for better error handling and state management. Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>

liyihuang · 2025-08-19T20:01:17Z

/test

liyihuang · 2025-08-19T22:00:07Z

/ci-eks

liyihuang · 2025-08-19T22:56:19Z

/ci-eks

liyihuang · 2025-08-20T00:44:36Z

/ci-eks

liyihuang · 2025-08-20T01:47:13Z

/ci-eks

liyihuang · 2025-08-20T02:46:06Z

/ci-eks

liyihuang · 2025-08-20T03:46:02Z

/ci-eks

liyihuang · 2025-08-20T12:04:14Z

/ci-eks

liyihuang · 2025-08-20T14:31:14Z

/ci-eks

liyihuang · 2025-08-22T15:38:23Z

I have ran more than 50 times for ci-eks and dont see the issue again.

I have a script keep https://github.com/cilium/cilium/actions/runs/17101497696/job/48682108893 running for last 3 days. I only see a few failure(one ipsec another 2-3 related to quay.io ). Hopefully this will fix the issue. if this get merged to main, I will keep monitoring the CI failure then make other changes to resolve this type of issue for AWS and other clouds

HadrienPatte

Can you put this new behavior behind a new operator flag so it can be selectively enabled or disabled? With the current implementation this will cause at least one new DescribeNetworkInterfaces call any time IPs are assigned or unassigned from an ENI which is quite costly in terms of API rate limiting, when the operator already makes a fairly high volume of DescribeNetworkInterfaces calls.

From looking at cloudtrail logs, it looks like on average the volume of assign/unassign calls made by the operator represents around 40% of the volume of describeENI calls. So that PR would lead to an increase of at least 40% of DescribeNetworkInterfaces calls which is quite significative.

pkg/aws/ec2/ec2.go

HadrienPatte · 2025-08-22T20:03:30Z

pkg/aws/ec2/ec2.go

+	readInput := &ec2.DescribeNetworkInterfacesInput{
+		NetworkInterfaceIds: []string{eniID},
+	}
+	readOutput, err := c.ec2Client.DescribeNetworkInterfaces(ctx, readInput)


Can you use the standard self-ratelimiting and observability wrapper on this API call?

c.limiter.Limit(ctx, DescribeNetworkInterfaces) sinceStart := spanstat.Start() readOutput, err := c.ec2Client.DescribeNetworkInterfaces(ctx, readInput) c.metricsAPI.ObserveAPICall(DescribeNetworkInterfaces, deriveStatus(err), sinceStart.Seconds())

sure. I have to admit I have some rough edges for this PR where I hope to have discussion to find the best approach to address the issue

liyihuang · 2025-08-22T21:00:18Z

Can you put this new behavior behind a new operator flag so it can be selectively enabled or disabled? With the current implementation this will cause at least one new DescribeNetworkInterfaces call any time IPs are assigned or unassigned from an ENI which is quite costly in terms of API rate limiting, when the operator already makes a fairly high volume of DescribeNetworkInterfaces calls.

From looking at cloudtrail logs, it looks like on average the volume of assign/unassign calls made by the operator represents around 40% of the volume of describeENI calls. So that PR would lead to an increase of at least 40% of DescribeNetworkInterfaces calls which is quite significative.

Thanks for the input. I dont have the insight to call volume. I have some worry about the flag is because this is a bug fix than a feature. If we feel there are too many calls, do you feel Skip the sync is overall better approach than read after the write?

Do you observe #36428 in your environment?

liyihuang · 2025-08-25T01:46:05Z

@HadrienPatte it's not directly related to this issue but could be related to your environment for AWS call, I can see our AWS call is pretty wild since we pull all VPCs/subnets etc by default. In theory, we could figure out the VPC that the operator is sitting and filter all resources by the operator's VPC. Hopefully, that could reduce the number of DescribeNetworkInterfaces. I'm also trying to see if you have a large cluster with high churn pods and not using prefix delegation and release the IP. If so, do you consider to use prefix delegation in your environment? I'm hoping we can understand more about your environment to figure out the balance. is it viable options if we can filter AWS resources to the only VPC that cilium is sitting to reduce the overall AWS call to solve the problem and still use read after write approach here?

As I mentioned in the PR, this is just the first attempt to solve the CI issue. if it can help, I'm planning to go through all AWS call and implement the read after write.

gandro · 2025-08-25T08:58:20Z

Thank you for your continued work on this ❤️ HadrienPatte already did an excellent review, +1 to all the feedback they gave.

I am however also very worried about the additional API overhead this introduces. I've left a more in-depth response in the issue (#36428 (comment)):

So I think we always recover from the stale information, right? If that is indeed the case, then I would argue that the bug is not very severe and any potential solution must be proportionate to the severity. In particular, if the solution doubles the number of API calls for IP allocation, only to make sure we don't do the occasional wrong and unnecessary API call, then I think we're introducing overhead without any benefit to the user. I'd rather have CI ignore these errors instead (or lower their severity).

Of course, the "skipping the sync" solution be the highest benefit to the user, but if it is not worth the complexity, then I would consider lowering the severity of the log message.

liyihuang · 2025-08-25T13:16:17Z

Thank you for your continued work on this ❤️ HadrienPatte already did an excellent review, +1 to all the feedback they gave.

I am however also very worried about the additional API overhead this introduces. I've left a more in-depth response in the issue (#36428 (comment)):

So I think we always recover from the stale information, right? If that is indeed the case, then I would argue that the bug is not very severe and any potential solution must be proportionate to the severity. In particular, if the solution doubles the number of API calls for IP allocation, only to make sure we don't do the occasional wrong and unnecessary API call, then I think we're introducing overhead without any benefit to the user. I'd rather have CI ignore these errors instead (or lower their severity).

Of course, the "skipping the sync" solution be the highest benefit to the user, but if it is not worth the complexity, then I would consider lowering the severity of the log message.

I will look into skipping the sync to figure out the efforts. The main reason for me to push the read after write commit is because it's easiest for me to implement it, and I also saw that Azure's API has the poller and I guess it's doing this type of reading(https://github.com/cilium/cilium/blob/main/pkg/azure/api/api.go#L633)

The main concern was logic change in ipam will affect all clouds and it takes longer for me to understand the codebase.

gandro · 2025-08-25T13:36:38Z

I also saw that Azure's API has the poller and I guess it's doing this type of reading(https://github.com/cilium/cilium/blob/main/pkg/azure/api/api.go#L633)

Right. Note that this is a different kind of polling - this just waits for the API to complete the operation (because it seem Azure is async by default). It does not verify if the operation is also visible in state returned by API calls that describe the IPs assigned to an interface. Because the latter is a much more expensive operation, it likely has different load and performance characteristics than polling the operation state (but I have not confirmed if this is true).

The main concern was logic change in ipam will affect all clouds and it takes longer for me to understand the codebase.

Maybe a potential solution would be to reduce the log severity from warning to info for now, giving you more time to dive into the rest of the codebase? Once we have a proper fix, we can make it a warning again.

HadrienPatte · 2025-08-25T15:53:49Z

Do you observe #36428 in your environment?

I did manage to find such logs, but they are extremely rare events (I found 13 PrivateIpAddressLimitExceeded errors out of 492,374 AssignPrivateIpAddresses events) and have no significative impact as the operator does eventually reconcile the state as described in the issue.

I'm also trying to see if you have a large cluster with high churn pods and not using prefix delegation and release the IP.

We have many clusters, with varying level of pod churn, some using prefix delegation and others not using it.

Overall my personal take is that even though there is a non nominal behavior here, the potential cost of fixing it when compared to the impact it has today would not justify fixing it in this way. Coming back to the initial issue raised in #36428 which is about the CI flakiness that is induced by this behavior, I think an acceptable solution to that issue could be to lower the severity of these events (which are already only logged as warnings) or update the CI to tolerate them so they don't make the CI fail anymore (which can be configured here I believe).

liyihuang · 2025-08-26T02:09:11Z

Do you observe #36428 in your environment?

I did manage to find such logs, but they are extremely rare events (I found 13 PrivateIpAddressLimitExceeded errors out of 492,374 AssignPrivateIpAddresses events) and have no significative impact as the operator does eventually reconcile the state as described in the issue.

I'm also trying to see if you have a large cluster with high churn pods and not using prefix delegation and release the IP.

We have many clusters, with varying level of pod churn, some using prefix delegation and others not using it.

Overall my personal take is that even though there is a non nominal behavior here, the potential cost of fixing it when compared to the impact it has today would not justify fixing it in this way. Coming back to the initial issue raised in #36428 which is about the CI flakiness that is induced by this behavior, I think an acceptable solution to that issue could be to lower the severity of these events (which are already only logged as warnings) or update the CI to tolerate them so they don't make the CI fail anymore (which can be configured here I believe).

Thanks for those production data and it's glad you have those event but overall still working fine. My concern was the stale data from AWS is just our assumption, it's hard to know the impact of the issue that we observed in CI.

@gandro with those data from datadog, do you think we should modify our code(I'm even thinking about revert my previous PR #37650) or just lower the severity of the log?

Based on my previous PR related to skipping the sync, here are the areas where changes would be required:

AWS: We need to implement direct updates for prefix delegation, bypassing the sync.
Alibaba Cloud: We should apply a similar approach as with AWS to avoid unnecessary sync logic.
Azure: We need to investigate what changes are necessary here, as Azure's API differs from AWS and Alibaba Cloud.
IPAM: We’ll need to modify the logic to skip the final sync step after modification, particularly around syncTime, which currently acts as a lock. This was a pain point in my previous attempt.

I'm happy to take this on—I just wanted to outline the additional code changes required so we can evaluate whether this approach is worthwhile overall.

I also noticed another EKS CI issue: #37948, which seems to be caused by stale data as well, at least based on my initial review.

Before working on this, I reviewed the AWS CNI code and spoke with a maintainer of Alibaba Cloud’s CNI. Neither implementation includes this kind of sync logic. In fact, the AWS CNI queries the instance metadata server to verify IP addresses directly.

I’ve tried to gather and lay out everything I’ve learned so far—hopefully you can help me assess what would be the best path forward.

@HadrienPatte also appreciate it if you have some insights.

gandro · 2025-08-26T07:06:15Z

Thank you @HadrienPatte for sharing your insights, very useful!

@liyihuang I agree, let's lower the severity for now (which is also a change we can easily backport). I would keep #37650 as I don't think it causes any overhead on the API and I do believe it does reduce the likelihood of us attempting to over-allocate. What would be the reason you want to revert it?

I also noticed another EKS CI issue: #37948, which seems to be caused by stale data as well, at least based on my initial review.

I believe this is a different class of error, since this happens on the agent (after the operator has told it that it can use a certain ENI). In that race, the ENI is attached on the API (and presumably also on the underlying instance), visible in the sync, but the local Linux kernel has not yet discovered the interface. This I think is just a race that probably is fixed by having the agent wait for the Linux network interface to appear. I think this is also an easier fix to implement than the stale API.

I'm happy to take this on—I just wanted to outline the additional code changes required so we can evaluate whether this approach is worthwhile overall.

Thanks for outlining the needed work. I am a bit unsure how to tackle this as well. In particular I am unsure if we really want to completely get rid of the sync - or if we only want to restructure the code in a way that it can correctly deal with stale information (i.e. detect stale state if certain operations fail and then wait for the sync to happen before continuing).

My main worry about not doing any syncs at all (if that is what you are suggesting) is that it can put the operator in a non-recoverable state where it has wrong information (e.g. because of an operator accounting bug, or because a user has modified an ENI manually) and keeps doing actions based on wrong information, without any chance of the information being corrected.

liyihuang · 2025-08-26T14:04:09Z

I would keep #37650 as I don't think it causes any overhead on the API and I do believe it does reduce the likelihood of us attempting to over-allocate. What would be the reason you want to revert it?

My main reason is to keep the same logic across other operations and make the logic simple

There is no reason that we are doing this for ip address but not prefix delegation and other write API operations
If we are getting the stale to override our local data, what's the point to write the local data if they will be overridden anyway.

I also have to admit my direct update did resolved my previous PR issue(I dont understand why/how it resolved it now if the stale data will override the local anyway)

My main worry about not doing any syncs at all (if that is what you are suggesting) is that it can put the operator in a non-recoverable state where it has wrong information (e.g. because of an operator accounting bug, or because a user has modified an ENI manually) and keeps doing actions based on wrong information, without any chance of the information being corrected.

I'm not sure if I fully understand the whole thing, we have the operator to pull the AWS data every 1 min for everything

cilium/pkg/ipam/node_manager.go

Lines 224 to 257 in cb307a5

    
           // Start kicks of the NodeManager by performing the initial state 
        
           // synchronization and starting the background sync goroutine 
        
           func (n *NodeManager) Start(ctx context.Context) error { 
        
           	// Trigger the initial resync in a blocking manner 
        
           	if _, ok := n.instancesAPIResync(ctx); !ok { 
        
           		return fmt.Errorf("Initial synchronization with instances API failed") 
        
           	} 
        
           	// Start an interval based  background resync for safety, it will 
        
           	// synchronize the state regularly and resolve eventual deficit if the 
        
           	// event driven trigger fails, and also release excess IP addresses 
        
           	// if release-excess-ips is enabled 
        
           	go func() { 
        
           		mngr := controller.NewManager() 
        
           		mngr.UpdateController("ipam-node-interval-refresh", 
        
           			controller.ControllerParams{ 
        
           				Group:       ipamNodeIntervalControllerGroup, 
        
           				RunInterval: time.Minute, 
        
           				DoFunc: func(ctx context.Context) error { 
        
           					start := time.Now() 
        
           					syncTime, ok := n.instancesAPIResync(ctx) 
        
           					if ok { 
        
           						n.metricsAPI.ObserveBackgroundSync(success, time.Since(start)) 
        
           						n.Resync(ctx, syncTime) 
        
           					} else { 
        
           						n.metricsAPI.ObserveBackgroundSync(failed, time.Since(start)) 
        
           					} 
        
           					return nil 
        
           				}, 
        
           			}) 
        
           	}() 
        
           	return nil 
        
           }

To me, skipping this sync is more like delaying the sync to sometime up to 1 min later. it doesn't sound that scary. Do I misunderstand it?

I can also look into AWS API to see if we can guess we are getting the stale data or we have to put more logic to determine it(for example, we know we just assign the IP and our internal IP is more than what we got from AWS, then we can guess AWS data is stale and not to override it)

gandro · 2025-08-26T14:20:53Z

If we are getting the stale to override our local data, what's the point to write the local data if they will be overridden anyway.

In some cases I think (unless I'm missing something) we'll do another operation using the local before the next sync. If that happens, we (temporarily) avoid working with stale data. I really don't feel strongly about #37650 - if you prefer to revert it, please go ahead (that is, once we've lowered the severity).

To me, skipping this sync is more like delaying the sync to sometime up to 1 min later.

This is fine. If that is the plan I have no concerns!

liyihuang · 2025-08-26T16:20:25Z

In some cases I think (unless I'm missing something) we'll do another operation using the local before the next sync. If that happens, we (temporarily) avoid working with stale data. I really don't feel strongly about #37650 - if you prefer to revert it, please go ahead (that is, once we've lowered the severity).

I will take another look for this, it's been long time that I may forget the details. I also dont have strong desire to revert it. I just feel we should keep our write operation to use the same logic and dont want my code confuse others in the future.

I think here are following steps that I will perform

lowering the related logs
looking into the direct update(I think if it's necessary, I prefer to do the direct update also for PD)
looking into the direct update for Alibaba Cloud
looking into if we can simply skip the last sync depending on complexity of the changes

@gandro do you think this is the correct plan or we dont have to do step 2->4 once we lower the severity since we will eventually reconcile to the right state?

Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in cilium#36428 and cilium#41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>

liyihuang · 2025-08-26T20:57:11Z

#41389 opened to lower the severity

gandro · 2025-08-27T07:41:17Z

I think here are following steps that I will perform
1. lowering the related logs
2. looking into the direct update(I think if it's necessary, I prefer to do the direct update also for PD)
3. looking into the direct update for Alibaba Cloud
4. looking into if we can simply skip the last sync depending on complexity of the changes
@gandro do you think this is the correct plan or we dont have to do step 2->4 once we lower the severity since we will eventually reconcile to the right state?

By "direct update" you mean expanding the approach taken in #37650?

I think we really should do 1 because it's easy and reduces CI issues. As for 2-4 I think they are nice to have. We should do them if we (you) have the cycles and it doesn't complicate the code too much.

Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in cilium#36428 and cilium#41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>

liyihuang · 2025-08-27T15:28:46Z

By "direct update" you mean expanding the approach taken in #37650?

yes.

I think we really should do 1 because it's easy and reduces CI issues. As for 2-4 I think they are nice to have. We should do them if we (you) have the cycles and it doesn't complicate the code too much.

I will take a look if time allows. I'm asking mainly to understand what the right balance is.

Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in cilium#36428 and cilium#41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>

Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>

[ upstream commit 347c8fc ] Lowering the log severity from warn to info for cases where stale data is received from AWS. As discussed in #36428 and #41278, stale metadata can cause temporary IP calculation issues, but the operator is expected to eventually reconcile correctly. These warnings have been observed to cause CI failures, despite being ok in Datadog's environment Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com>

liyihuang force-pushed the pr/liyihuang/check_aws_eni_ips branch from 09f5885 to 7b84c44 Compare August 19, 2025 17:48

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 19, 2025

liyihuang force-pushed the pr/liyihuang/check_aws_eni_ips branch from 7b84c44 to 7b7b3aa Compare August 19, 2025 19:49

liyihuang added integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. area/misc Impacts miscellaneous areas of the code not otherwise owned by another area. release-note/misc This PR makes changes that have no direct user impact. labels Aug 19, 2025

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 19, 2025

liyihuang added area/CI Continuous Integration testing issue or flake dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 19, 2025

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 19, 2025

liyihuang marked this pull request as ready for review August 22, 2025 15:38

liyihuang requested a review from a team as a code owner August 22, 2025 15:38

liyihuang requested a review from gandro August 22, 2025 15:38

HadrienPatte reviewed Aug 22, 2025

View reviewed changes

liyihuang mentioned this pull request Aug 26, 2025

Lower log level for Cloud/AWS data fetches #41389

Merged

liyihuang closed this Aug 27, 2025

pkg/aws: check ENI ips with AWS #41278

pkg/aws: check ENI ips with AWS #41278

Uh oh!

Conversation

liyihuang commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approaches considered

1. Read-after-write

2. Skip the sync

Uh oh!

liyihuang commented Aug 19, 2025

Uh oh!

liyihuang commented Aug 19, 2025

Uh oh!

liyihuang commented Aug 19, 2025

Uh oh!

liyihuang commented Aug 19, 2025

Uh oh!

liyihuang commented Aug 20, 2025

Uh oh!

liyihuang commented Aug 20, 2025

Uh oh!

liyihuang commented Aug 20, 2025

Uh oh!

liyihuang commented Aug 20, 2025

Uh oh!

liyihuang commented Aug 20, 2025

Uh oh!

liyihuang commented Aug 20, 2025

Uh oh!

liyihuang commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HadrienPatte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HadrienPatte Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

liyihuang Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

liyihuang commented Aug 22, 2025

Uh oh!

liyihuang commented Aug 25, 2025

Uh oh!

gandro commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liyihuang commented Aug 25, 2025

Uh oh!

gandro commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HadrienPatte commented Aug 25, 2025

Uh oh!

liyihuang commented Aug 26, 2025

Uh oh!

gandro commented Aug 26, 2025

Uh oh!

liyihuang commented Aug 26, 2025

Uh oh!

gandro commented Aug 26, 2025

Uh oh!

liyihuang commented Aug 26, 2025

Uh oh!

liyihuang commented Aug 26, 2025

Uh oh!

gandro commented Aug 27, 2025

Uh oh!

liyihuang commented Aug 27, 2025

Uh oh!

Uh oh!

liyihuang commented Aug 19, 2025 •

edited

Loading

liyihuang commented Aug 22, 2025 •

edited

Loading

gandro commented Aug 25, 2025 •

edited

Loading

gandro commented Aug 25, 2025 •

edited

Loading