Skip to content

Explore improved Envoy or Rust ztunnel implementation #40956

@ejj

Description

@ejj

The current experimental implementation of ztunnel suffers from a number of issues we'd like to address in the long term. A brief summary:

  • xds efficiency: The xDS configuration of envoy relies on copying the entire config pipeline per workload. We expect this to make it unscalable for large clusters. This also makes updates to the xds more expensive than it should be.
  • Multi-tenancy. At time of writing, we don't have good measurements for how well handles mutli-tenancy issues (like noisy neighbor) in the ztunnel case.
  • Again more measurements are needed, but due to the xds efficiency issues mentioned above, we expect the RAM/CPU/Latency costs of ztunnel to be higher than we'd like.

To address these issues there's two paths forward we need to explore

Evolving Envoy to better support the ztunnel

Envoy needs additional primitives that allow us to express the ztunnel more efficiently. Things like a many-to-many HBONE transport socket, more efficient ways to set/retrieve metadata from HBONE headers without internal listeners, and better multi-tenancy properties.

Build a new L4 proxy

Depending on how heavy a lift the envoy changes we need are, it could be that we're better off just building a new purpose-built proxy. We'd likely do this in Rust, given its memory safety properties, and the existing high quality proxy implementations written in the language (linkerd).

Path Forward

Overall I expect work on this task to proceed in the following steps:

  • Gather requirements for the ztunnel
  • Measure the current state of ztunnel against those requirements.
  • Develop an engineering plan for Envoy ztunnel implementation, and estimate engineering effort required
  • Develop an engineering plan for Rust implementation, and estimate engineering effort required.
  • Build one/both

Affected product area (please put an X in all that apply)*

[x] Ambient
[ ] Docs
[ ] Installation
[ ] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster
[ ] Virtual Machine
[ ] Multi Control Plane

Additional context

Metadata

Metadata

Assignees

Labels

area/ambientIssues related to ambient mesh

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions