Skip to content

Conversation

liyihuang
Copy link
Contributor

@liyihuang liyihuang commented Aug 13, 2025

This is the temporary fix for ENI mode to avoid the sync from AWS getting the stale data since we would like to see if this can fix the CI issue for #36428 based on the discussion here #36428 (comment).

If so, we will change how alibaba and azure behavior like PR #37650 and remove this instanceSync trigger sync from here completely.

I have tested skip this trigger sync with draft PR #40760 for a few times and it seems to be working.

We hope we can test this for 2-4 weeks so we can know if the CI issue got fixed. if we dont see the CI issue again, I will have the follow up PR to change Azure and Alibaba cloud and remove this trigger.

I know we just released 1.18.0 so I think we are safe and will not include this temporary change in any release.

Feel free to let me know I should do it in another way.

I have thought different ways to resolve this issue.

The first way is to read after write. Since AWS is eventually consistent, we could retry and read the data from AWS at here https://github.com/cilium/cilium/blob/main/pkg/aws/ec2/ec2.go#L806 through DescribeNetworkInterfaces to ensure that we will get the updated data from AWS. I have the following concerns for this way

  • we will have to do this for all writing APIs
  • we also read the ENI info from EC2 API like NewDescribeInstancesPaginator, I'm not sure if I will still get stale info through NewDescribeInstancesPaginator while DescribeNetworkInterfaces is returning the latest.

The second way is just to skip this sync since we already updated it in our cache through PR #37650.

I have read the AWS CNI code and it seems they also just add to their cache https://github.com/aws/amazon-vpc-cni-k8s/blob/707708191e2dcb7ebe1a642a119b4e01afa9f29f/pkg/ipamd/ipamd.go#L1092-L1149 and dont see they sync again from AWS like what we do here.

I have another PR on my local laptop where I'm changing Alibaba and Azure code. I also found Azure using poller, it looks like that's how they resolve the eventual consistency issue. However, I never worked with Azure before so I'm not sure if I understand it correctly.

I believe the latter way is better and align with AWS CNI implementation. that's why I open the PR this way.

Fixes: #36428

 

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 13, 2025
@liyihuang liyihuang force-pushed the pr/liyi/ci-eks-36428 branch from 3379d9f to 64b64de Compare August 14, 2025 13:36
@liyihuang
Copy link
Contributor Author

/test

this is the temporary fix for ENI mode to avoid the sync from AWS getting the stale data since
we would like to see if this can fix the CI issue for cilium#36428
based on the discussion here cilium#36428 (comment)
if so, we will change how alibaba and azure behavior like PR cilium#37650
and remove this instanceSync trigger sync from here completely.
I have tested skip this trigger sync with draft PR cilium#40760 for a few times
and it seems to be working. we hope we can test this for 2-4 weeks so we can know if the CI issue got fixed. if we dont see the CI issue again, we will have the follow up PR to change Azure and Alibaba cloud and remove this trigger.

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
@liyihuang liyihuang force-pushed the pr/liyi/ci-eks-36428 branch from 64b64de to fcbc9b4 Compare August 15, 2025 19:45
@liyihuang
Copy link
Contributor Author

/test

1 similar comment
@liyihuang
Copy link
Contributor Author

/test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont-merge/needs-release-note-label The author needs to describe the release impact of these changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Conformance EKS - no-errors-in-logs - Unable to assign additional IPs to interface, will create new interface
1 participant