Skip to content

Conversation

antonipp
Copy link
Contributor

@antonipp antonipp commented Mar 4, 2025

Description

Follow-up to #14491

We noticed that DescribeNetworkInterfaces API calls are not properly paginated. The code uses ec2.NewDescribeNetworkInterfacesPaginator but:

  • it doesn't specify MaxResults in DescribeNetworkInterfacesInput,
  • nor does it specify any Limit in the Paginator options:
    paginator := ec2.NewDescribeNetworkInterfacesPaginator(c.ec2Client, input, func(o *ec2.DescribeNetworkInterfacesPaginatorOptions) {
      	  o.Limit = 1000
    })

This causes pages to have unlimited size because MaxResults ends up not being set: https://github.com/aws/aws-sdk-go-v2/blob/2d43b815f645eae163e7521c9789554dfe60e40c/service/ec2/api_op_DescribeNetworkInterfaces.go#L548-L552

This PR fixes the issue by adding MaxResults to the API calls in order to actually force pagination.

Testing

Before the PR

Test 1

I did some testing in an AWS account which has ~19000 ENIs. I deployed the Operator without --subnet-tags-filter and added logging to the describeNetworkInterfaces() function. The log showed that there was only 1 page with ~19000 results.

API call duration went through the roof (>10 seconds):
image
(comparison before vs after I removed --subnet-tags-filter)

The memory also more than doubled:
image

Test 2

We also tried deploying the Operator without --subnet-tags-filter in a larger account and got the following error from the AWS API:

{
  "error": "operation error EC2: DescribeNetworkInterfaces, https response error StatusCode: 400, RequestID: da1b1af3-4c8d-4f1d-86eb-e1acbe2c2925, api error OperationNotPermitted: This operation is not permitted.",
  "level": "warning",
  "msg": "Unable to synchronize EC2 interface list",
  "subsys": "eni",
  "time": "2025-02-19T17:48:07.983328963Z"
}

According to AWS docs:

If you have a large number of network interfaces, the operation fails unless you use pagination or one of the following filters: group-id, mac-address, private-dns-name, private-ip-address, subnet-id, or vpc-id.

Which is exactly what's happening here: the filters are not specified and pagination is disabled, so we get a 400 error from the API.

After the PR

I confirmed with logging, that the calls are now properly paginated:
image

Moreover, the duration of API calls dropped:
image

However, the memory usage stayed the same unfortunately:
image

This is due to the fact that all results are still all added to the same big slice:

result = append(result, output.NetworkInterfaces...)

However, the pagination is still beneficial, because we do not risk running into 400 errors from the API or be "susceptible to throttling and timeouts"

Notes

  • This issue also affects other API calls such as DescribeInstanceTypes, DescribeVpcs, DescribeSubnets, and DescribeSecurityGroups, which aren’t paginated properly either. However, the impact is most noticeable for ENIs, since they’re far more numerous and the fix will have the biggest effect there. We can address the other API calls in a separate PR.
  • I hard-coded the page size to 1000 elements. We can also make it configurable but I am not sure it's worth it?
  • This change is not very interesting for describeNetworkInterfacesByInstance() because the number of returned of results should be very small anyway. I added the change there for the sake of consistency.
  • I also noticed that the logic in GetDetachedNetworkInterfaces() is a bit strange: it actually uses pagination but the number of returned ENIs can exceed maxResults passed in the parameters. Mainly, this can happen here if the first page returns x results where x < maxResults and then the second page returns for example exactly maxResults, so the length of result will be maxResults + x > maxResults. The break should happen inside the loop on output.NetworkInterfaces. I think the fix for this problem is out of scope for this PR though.
ipam/aws: properly paginate Operator `DescribeNetworkInterfaces` AWS API calls in ENI IPAM mode in order to avoid throttling, timeouts and errors from the API

Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
@antonipp antonipp requested review from a team as code owners March 4, 2025 10:19
@antonipp antonipp requested review from thorn3r and doniacld March 4, 2025 10:19
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 4, 2025
@antonipp
Copy link
Contributor Author

antonipp commented Mar 4, 2025

/test

Copy link
Contributor

@thorn3r thorn3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 thanks for the fix

@pchaigno pchaigno added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Mar 18, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Mar 18, 2025
@pchaigno pchaigno added area/operator Impacts the cilium-operator component area/eni Impacts ENI based IPAM. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Mar 18, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Mar 18, 2025
@pchaigno pchaigno added this pull request to the merge queue Mar 18, 2025
Merged via the queue into cilium:main with commit 7e21480 Mar 18, 2025
77 checks passed
@YutaroHayakawa YutaroHayakawa mentioned this pull request Mar 18, 2025
13 tasks
@YutaroHayakawa YutaroHayakawa added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Mar 18, 2025
@github-actions github-actions bot removed the backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. label Mar 21, 2025
@github-actions github-actions bot added the backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. label Mar 21, 2025
@HadrienPatte HadrienPatte deleted the ai/ec2-describenetworkinterfaces-pagination branch April 23, 2025 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. area/operator Impacts the cilium-operator component backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants