-
Notifications
You must be signed in to change notification settings - Fork 8k
Closed
Labels
area/perf and scalabilitylifecycle/automatically-closedIndicates a PR or issue that has been closed automatically.Indicates a PR or issue that has been closed automatically.lifecycle/staleIndicates a PR or issue hasn't been manipulated by an Istio team member for a whileIndicates a PR or issue hasn't been manipulated by an Istio team member for a while
Description
Bug Description
This is more of a performance issue rather than a bug.
We notice that turning on xds cache helps with cpu, but impacts propagation delay a lot due to cache locking. We conducted the following tests:
setup
Control plane:
- 1 istiod pod
- memory 26 G, cpu 10 cores
Data plane:
- 50 services and 500 pods
- and we mimic changes by pod rotation (roughly 5 pods rotated every 10 seconds) and istio config change (one virtual service routing change randomly per 5 seconds)
results
I got the following results:
- When we have all cache enabled (default), the proxy convergence time (pilot_proxy_convergence_time) was on average 2.9s for P90, and 4.26s for P99.
- Disable CDS cache alone, the proxy convergence time was on average 2.1s for P90, and 3s for P99.
- Disable both CDS and RDS cache (with this change patched), the proxy convergence time was on average 0.94s for P90, and 2.55s for P99. Some CPU throttling occurs so I think if more CPU is given, propagation delay (especially P99) should further improve.
- Disable all cache, the proxy convergence time was on average 0.85s for P90, and 2.39s for P99. CPU throttling was heavier than previously as expected.
I also enabled the mutex profile in a custom istiod build and saw 70% of time (209s out of 300s) was spent on waiting for lock for lruCache in the default case (all cache enabled):
proposal
I think for the short term, we should fix this bug to allow users to opt out RDS caching properly: #40719
For the longer term, is there any work planned for improving the xds caching? Some ideas:
- Cache per xds type. Right now seems like all xds calculation uses the same cache, separating out might help with lock contention
- Explore high concurrency lru caches (for example https://github.com/karlseguin/ccache), or other cache options
Version
Istio 1.13.5
Additional Information
No response
Metadata
Metadata
Assignees
Labels
area/perf and scalabilitylifecycle/automatically-closedIndicates a PR or issue that has been closed automatically.Indicates a PR or issue that has been closed automatically.lifecycle/staleIndicates a PR or issue hasn't been manipulated by an Istio team member for a whileIndicates a PR or issue hasn't been manipulated by an Istio team member for a while