fix: temporary fix for intermittent CI issue #41140
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the temporary fix for ENI mode to avoid the sync from AWS getting the stale data since we would like to see if this can fix the CI issue for #36428 based on the discussion here #36428 (comment).
If so, we will change how alibaba and azure behavior like PR #37650 and remove this instanceSync trigger sync from here completely.
I have tested skip this trigger sync with draft PR #40760 for a few times and it seems to be working.
We hope we can test this for 2-4 weeks so we can know if the CI issue got fixed. if we dont see the CI issue again, I will have the follow up PR to change Azure and Alibaba cloud and remove this trigger.
I know we just released 1.18.0 so I think we are safe and will not include this temporary change in any release.
Feel free to let me know I should do it in another way.
I have thought different ways to resolve this issue.
The first way is to read after write. Since AWS is eventually consistent, we could retry and read the data from AWS at here https://github.com/cilium/cilium/blob/main/pkg/aws/ec2/ec2.go#L806 through
DescribeNetworkInterfaces
to ensure that we will get the updated data from AWS. I have the following concerns for this wayNewDescribeInstancesPaginator
, I'm not sure if I will still get stale info throughNewDescribeInstancesPaginator
whileDescribeNetworkInterfaces
is returning the latest.The second way is just to skip this sync since we already updated it in our cache through PR #37650.
I have read the AWS CNI code and it seems they also just add to their cache https://github.com/aws/amazon-vpc-cni-k8s/blob/707708191e2dcb7ebe1a642a119b4e01afa9f29f/pkg/ipamd/ipamd.go#L1092-L1149 and dont see they sync again from AWS like what we do here.
I have another PR on my local laptop where I'm changing Alibaba and Azure code. I also found Azure using poller, it looks like that's how they resolve the eventual consistency issue. However, I never worked with Azure before so I'm not sure if I understand it correctly.
I believe the latter way is better and align with AWS CNI implementation. that's why I open the PR this way.
Fixes: #36428