Skip to content

Conversation

alan-kut
Copy link
Contributor

@alan-kut alan-kut commented Nov 3, 2023

Rate limiting work queue has rate limiting only when AddRateLimited method is used. It doesn't accept minimum delay so instead delaying queue with separate rate limitter is used.

Fixes rate limiting for CES Controller

@alan-kut alan-kut requested a review from a team as a code owner November 3, 2023 08:37
@alan-kut alan-kut requested a review from pippolo84 November 3, 2023 08:37
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 3, 2023
@alan-kut
Copy link
Contributor Author

alan-kut commented Nov 3, 2023

/test

Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 💯

@pippolo84 pippolo84 added area/operator Impacts the cilium-operator component release-note/misc This PR makes changes that have no direct user impact. labels Nov 3, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 3, 2023
@pippolo84
Copy link
Member

/test

@alan-kut
Copy link
Contributor Author

alan-kut commented Nov 6, 2023

Failed tests:

  • FAIL: TestConformance/HTTPRouteTimeoutBackendRequest/1_request_to_'/backend-timeout?delay=1s'_should_receive_a_504 (30.11s)
  • FAIL: TestConformance/HTTPRouteTimeoutRequest/1_request_to_'/request-timeout?delay=1s'_should_receive_a_504 (30.00s)
  • ❌ host-entity/pod-to-host/ping-ipv4-0: cilium-test/client-6b4b857d98-vmpgg (10.0.1.84) -> 10.224.0.5 (10.224.0.5:0)
    Error: Process completed with exit code 1.
  • ❌ 1/51 tests failed (0/535 actions), 14 tests skipped, 1 scenarios skipped:
    Test [no-interrupted-connections]:
    connectivity test failed: 1 tests failed
    Error: Process completed with exit code 1.
  • Kubernetes sig-network conformance test

@alan-kut
Copy link
Contributor Author

alan-kut commented Nov 6, 2023

I don't see how the commit would break those tests. They would need to use network policies using CESs in the first place to be able to fail.

@alan-kut alan-kut force-pushed the pr-fix-ces-rate-limit branch from 7b19af2 to 9867008 Compare November 7, 2023 07:58
@alan-kut
Copy link
Contributor Author

alan-kut commented Nov 7, 2023

/test

@alan-kut
Copy link
Contributor Author

alan-kut commented Nov 7, 2023

Rebased, retested - all the tests passed

@alan-kut
Copy link
Contributor Author

alan-kut commented Nov 7, 2023

Actually Cilium IPsec upgrade (ci-ipsec-upgrade) failed. I missed it before. But I don't recall it failing before the rebase.

@pippolo84
Copy link
Member

Travis seems to be stuck, I'm gonna close and reopen the PR to re-trigger it.

@pippolo84 pippolo84 closed this Nov 8, 2023
@pippolo84 pippolo84 reopened this Nov 8, 2023
@pippolo84
Copy link
Member

All required tests are green, marking ready-to-merge.

@alan-kut
Copy link
Contributor Author

alan-kut commented Nov 9, 2023

Please wait with merging. I'm performing final tests of the change.

@pippolo84 pippolo84 added the dont-merge/waiting-for-user-feedback Waiting for feedback from user before the PR should be merged. label Nov 9, 2023
@pippolo84
Copy link
Member

Please wait with merging. I'm performing final tests of the change.

Hey Alan, just out of curiosity, can you share more details about what kind of testing you are doing? It could be an inspiration to improve the testing performed in the CI.

@alan-kut alan-kut force-pushed the pr-fix-ces-rate-limit branch from 9867008 to 638ad3c Compare November 10, 2023 10:19
@alan-kut
Copy link
Contributor Author

I have two manual tests for verification right now:

  1. From Full CES Reconciliation work - go program that triggers in a loop changes in number of deployments verifying CESs are properly updated and then changes in CESs verifying CESs are properly reconciliated.
  2. I created cluster with 1k nodes, deployment with 200k pods and then updated the label in the namespace to trigger massive CEPs updates leading to massive CESs updates and veryfing rate for CESs changes is as expected.

Unfortunately it wasn't. The code was doing rate limiting before item is added to the workqueue. So if CES has 50 CEPs and we have change for all of them we may process CES once (if all 50 CEPs events happened while CES waited in the queue) but we will use 50 rate limit tokens.

I fixed the code one more time. Exponential backoff is back in the queue and now rate limit is enforced before processing next item.

Sorry for confusion

Rate limiting work queue has rate limiting only when AddRateLimited
method is used. It doesn't accept minimum delay so instead delaying
queue with separate rate limitter is used.

Signed-off-by: Alan Kutniewski <kutniewski@google.com>
@alan-kut alan-kut force-pushed the pr-fix-ces-rate-limit branch from 638ad3c to 9779d63 Compare November 10, 2023 10:34
@alan-kut
Copy link
Contributor Author

/test

Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Approving the last version of the changes! 🚀

@pippolo84 pippolo84 removed the dont-merge/waiting-for-user-feedback Waiting for feedback from user before the PR should be merged. label Nov 10, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 10, 2023
@squeed squeed merged commit fc02f40 into cilium:main Nov 13, 2023
skmatti pushed a commit to skmatti/cilium that referenced this pull request Jul 24, 2024
Rate limiting work queue has rate limiting only when AddRateLimited
method is used. It doesn't accept minimum delay so instead delaying
queue with separate rate limitter is used.

Change-Id: If3df7aabd54b82ce8c1738df204f9fec887f2c8e
Signed-off-by: Alan Kutniewski <kutniewski@google.com>
Upstream-PR: cilium#28963
Bug: b/309596274
Reviewed-on: https://gke-internal-review.googlesource.com/c/third_party/cilium/+/864762
Tested-by: Prow_Bot_V2 <425329972751-compute@developer.gserviceaccount.com>
Unit-Verified: Prow_Bot_V2 <425329972751-compute@developer.gserviceaccount.com>
Reviewed-by: Prow_Bot_V2 <425329972751-compute@developer.gserviceaccount.com>
Reviewed-by: Akhil Velagapudi <avelagap@google.com>
@pchaigno pchaigno added the feature/ces Impacts the Cilium Endpoint Slice logic. label Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component feature/ces Impacts the Cilium Endpoint Slice logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants