Skip to content

XDS cache impacts propagation delay #40744

@yingzhuivy

Description

@yingzhuivy

Bug Description

This is more of a performance issue rather than a bug.

We notice that turning on xds cache helps with cpu, but impacts propagation delay a lot due to cache locking. We conducted the following tests:

setup

Control plane:

  • 1 istiod pod
  • memory 26 G, cpu 10 cores

Data plane:

  • 50 services and 500 pods
  • and we mimic changes by pod rotation (roughly 5 pods rotated every 10 seconds) and istio config change (one virtual service routing change randomly per 5 seconds)

results

I got the following results:

  • When we have all cache enabled (default), the proxy convergence time (pilot_proxy_convergence_time) was on average 2.9s for P90, and 4.26s for P99.
  • Disable CDS cache alone, the proxy convergence time was on average 2.1s for P90, and 3s for P99.
  • Disable both CDS and RDS cache (with this change patched), the proxy convergence time was on average 0.94s for P90, and 2.55s for P99. Some CPU throttling occurs so I think if more CPU is given, propagation delay (especially P99) should further improve.
  • Disable all cache, the proxy convergence time was on average 0.85s for P90, and 2.39s for P99. CPU throttling was heavier than previously as expected.

I also enabled the mutex profile in a custom istiod build and saw 70% of time (209s out of 300s) was spent on waiting for lock for lruCache in the default case (all cache enabled):
Screen Shot 2022-08-29 at 10 49 20 AM

proposal

I think for the short term, we should fix this bug to allow users to opt out RDS caching properly: #40719
For the longer term, is there any work planned for improving the xds caching? Some ideas:

  • Cache per xds type. Right now seems like all xds calculation uses the same cache, separating out might help with lock contention
  • Explore high concurrency lru caches (for example https://github.com/karlseguin/ccache), or other cache options

Version

Istio 1.13.5

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/perf and scalabilitylifecycle/automatically-closedIndicates a PR or issue that has been closed automatically.lifecycle/staleIndicates a PR or issue hasn't been manipulated by an Istio team member for a while

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions