Skip to content

ipam/aws: update the assigned IPs from aws to instance map #37650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

liyihuang
Copy link
Contributor

@liyihuang liyihuang commented Feb 15, 2025

  • Extracting and returning the assigned IP addresses from aws
  • Adding a new method to update ENI IP addresses in the instances manager

Fixes: #36428

Fix the possible race condition caused by async update from aws to instance map in issue #36428 

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 15, 2025
@liyihuang
Copy link
Contributor Author

/test

@liyihuang liyihuang force-pushed the pr/liyihuang/update_ip_to_eni_directly branch from bd54d82 to 054974f Compare February 15, 2025 05:52
@liyihuang
Copy link
Contributor Author

/test

@liyihuang liyihuang force-pushed the pr/liyihuang/update_ip_to_eni_directly branch from 054974f to ad5066d Compare February 15, 2025 05:57
@liyihuang
Copy link
Contributor Author

/test

@liyihuang liyihuang force-pushed the pr/liyihuang/update_ip_to_eni_directly branch from ad5066d to dabe5cd Compare February 15, 2025 13:10
@liyihuang
Copy link
Contributor Author

I had the commit to print out the result to validate the result(print out the IP address getting from AWS and print out the IP in the ENI from before and after using AddIPsToENI in instance map). I pushed it to the PR and check it in the CI. Then I can see the log showing that the caculation is correct(available=5) right after the IP address has been added to the ENI. this should solve the issue.

2025-02-15T06:18:16.838766821Z time="2025-02-15T06:18:16.834201931Z" level=debug msg="Recalculated needed addresses" available=0 capacity=15 instanceID=i-0e41d8fadd6be8813 name=ip-192-168-174-15.ca-central-1.compute.internal remainingInterfaces=3 resyncNeeded="0001-01-01 00:00:00 +0000 UTC" subsys=ipam toAlloc=8 toRelease=0 used=0 waitingForPoolMaintenance=true
2025-02-15T06:18:17.211128260Z time="2025-02-15T06:18:17.210130985Z" level=warning msg="get ip from aws" assignedIPs="[192.168.175.89 192.168.189.173 192.168.160.142 192.168.178.174 192.168.181.39]" interfaceID=eni-0ded708bc192d31c9 subsys=eni
2025-02-15T06:18:17.211150690Z time="2025-02-15T06:18:17.210178056Z" level=warning msg="before add ip to eni" assignedIPs="[]" interfaceID=eni-0ded708bc192d31c9 subsys=eni
2025-02-15T06:18:17.211155250Z time="2025-02-15T06:18:17.210199046Z" level=warning msg="after add ip to eni" assignedIPs="[192.168.175.89 192.168.189.173 192.168.160.142 192.168.178.174 192.168.181.39]" interfaceID=eni-0ded708bc192d31c9 subsys=eni
2025-02-15T06:18:17.211159980Z time="2025-02-15T06:18:17.210231866Z" level=debug msg="Setting resync needed" instanceID=i-0e41d8fadd6be8813 name=ip-192-168-174-15.ca-central-1.compute.internal subsys=ipam
2025-02-15T06:18:17.211164480Z time="2025-02-15T06:18:17.210257987Z" level=debug msg="Recalculated needed addresses" available=5 capacity=15 instanceID=i-0e41d8fadd6be8813 name=ip-192-168-174-15.ca-central-1.compute.internal remainingInterfaces=2 resyncNeeded="2025-02-15 06:18:17.210240626 +0000 UTC m=+5.905289604" subsys=ipam toAlloc=3 toRelease=0 used=0 waitingForPoolMaintenance=false

I guess we should do similiar thing for prefix delegation but I will raise another PR for it later once we agree this is the best way to resolve the issue

@liyihuang
Copy link
Contributor Author

/test

@liyihuang
Copy link
Contributor Author

/ci-ipsec-e2e

@liyihuang
Copy link
Contributor Author

/ci-ipsec-upgrade

Copy link
Contributor

@doniacld doniacld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything wrong in this PR but since @gandro was involved in the discussion, it would be nice to have his review.

@gandro
Copy link
Member

gandro commented Feb 18, 2025

Apologies for the delay in review. Since (at least to my knowledge) we still don't understand the root cause of the stale data in the instances manager, I'm a bit worried that this papers over a real issue that might affect other API calls (e.g. attaching a new ENI) as well.

Edit: Discussed offline, apparently attaching a new ENI already uses this approach:

cilium/pkg/aws/eni/node.go

Lines 535 to 536 in e21d1e4

// Add the information of the created ENI to the instances manager
n.manager.UpdateENI(n.node.InstanceID(), eni)

Having said that, if we do not find the real root cause, I'm happy to merge this as a workaround.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more thinking about this approach, and I now agree that it is overall the right thing to do.

There is one potential race that I think needs to be fixed, but otherwise I think this will improve the situation of us miscalculating the number of available addresses.

@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. area/eni Impacts ENI based IPAM. needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Feb 25, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 25, 2025
@liyihuang liyihuang force-pushed the pr/liyihuang/update_ip_to_eni_directly branch from dabe5cd to 4bd1599 Compare February 26, 2025 14:21
@liyihuang
Copy link
Contributor Author

/test

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@gandro
Copy link
Member

gandro commented Feb 26, 2025

Note that this needs a rebase for CI to be able to pass https://cilium.slack.com/archives/C7PE7V806/p1740452895682009

@liyihuang
Copy link
Contributor Author

Note that this needs a rebase for CI to be able to pass https://cilium.slack.com/archives/C7PE7V806/p1740452895682009

Great. I'm waiting for this PR get merged so I can work on my orginal PR(#36922).

I will also have the follow up PR to remove the IP and dealing with PD. Shall I open other issues? or I should open the agianst #36428

@liyihuang liyihuang force-pushed the pr/liyihuang/update_ip_to_eni_directly branch from 4bd1599 to 3411e23 Compare February 26, 2025 16:19
@liyihuang
Copy link
Contributor Author

/test

@liyihuang
Copy link
Contributor Author

/ci-clustermesh

@liyihuang
Copy link
Contributor Author

/ci-integration

@liyihuang
Copy link
Contributor Author

/ci-awscni

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 26, 2025
@liyihuang liyihuang force-pushed the pr/liyihuang/update_ip_to_eni_directly branch from 3411e23 to e4b6f0b Compare February 27, 2025 03:28
@liyihuang liyihuang requested a review from gandro February 27, 2025 03:42
@liyihuang
Copy link
Contributor Author

/test

@liyihuang
Copy link
Contributor Author

I re-request a review since I have a new commit but I don't have the permission to remove the ready to merge label. Please don't merge until I get a new approval

@gandro gandro removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 27, 2025
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Remove logic looks good to me! One subjective non-blocking nit

- Extracting and returning the assigned IP addresses
- Adding new methods to update ENI IP addresses in the instances manager
- Update the IP to ENI after getting the IP from aws

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
@liyihuang liyihuang force-pushed the pr/liyihuang/update_ip_to_eni_directly branch from e4b6f0b to 0801cff Compare February 27, 2025 13:03
@liyihuang
Copy link
Contributor Author

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 27, 2025
@joestringer joestringer added this pull request to the merge queue Feb 27, 2025
Merged via the queue into cilium:main with commit 45326ef Feb 27, 2025
63 checks passed
@liyihuang liyihuang deleted the pr/liyihuang/update_ip_to_eni_directly branch March 6, 2025 11:54
@rastislavs rastislavs mentioned this pull request Mar 10, 2025
14 tasks
@rastislavs rastislavs added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Mar 10, 2025
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Mar 11, 2025
liyihuang added a commit to liyihuang/cilium that referenced this pull request Aug 13, 2025
this is the temporary fix for ENI mode to avoid the sync from AWS getting the stale data since
we would like to see if this can fix the CI issue for cilium#36428
based on the discussion here cilium#36428 (comment)
if so, we will change how alibaba and azure behavior like PR cilium#37650
and remove this instanceSync trigger sync from here completely.
I have tested skip this trigger sync with draft PR cilium#40760 for a few times
and it seems to be working. we hope we can test this for 2-4 weeks so we can know if the CI issue got fixed. if we dont see the CI issue again, we will have the follow up PR to change Azure and Alibaba cloud and remove this trigger.

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
liyihuang added a commit to liyihuang/cilium that referenced this pull request Aug 14, 2025
this is the temporary fix for ENI mode to avoid the sync from AWS getting the stale data since
we would like to see if this can fix the CI issue for cilium#36428
based on the discussion here cilium#36428 (comment)
if so, we will change how alibaba and azure behavior like PR cilium#37650
and remove this instanceSync trigger sync from here completely.
I have tested skip this trigger sync with draft PR cilium#40760 for a few times
and it seems to be working. we hope we can test this for 2-4 weeks so we can know if the CI issue got fixed. if we dont see the CI issue again, we will have the follow up PR to change Azure and Alibaba cloud and remove this trigger.

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
liyihuang added a commit to liyihuang/cilium that referenced this pull request Aug 15, 2025
this is the temporary fix for ENI mode to avoid the sync from AWS getting the stale data since
we would like to see if this can fix the CI issue for cilium#36428
based on the discussion here cilium#36428 (comment)
if so, we will change how alibaba and azure behavior like PR cilium#37650
and remove this instanceSync trigger sync from here completely.
I have tested skip this trigger sync with draft PR cilium#40760 for a few times
and it seems to be working. we hope we can test this for 2-4 weeks so we can know if the CI issue got fixed. if we dont see the CI issue again, we will have the follow up PR to change Azure and Alibaba cloud and remove this trigger.

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Conformance EKS - no-errors-in-logs - Unable to assign additional IPs to interface, will create new interface
5 participants