Our journey switching from twemproxy to Envoy Redis Proxy (and some perf issues)

Some background: we have 20 Redis instances serving our API plus various other backend jobs (collectively, our "applications"). In front of this, we had 10 twemproxy instances proxying traffic based on consistent hash of the key. All of this is within Kubernetes. Our applications access the proxy via Kubernetes internal service based on a label selector. Our Redis instances are setup as a statefulset with each pod in the set having a custom instance name, from `redis-shard-0` through `redis-shard-19`.

### twemproxy configuration
```
default:
  listen: 0.0.0.0:6379
  hash: fnv1a_64
  hash_tag: "{}"
  distribution: ketama
  auto_eject_hosts: false
  timeout: 400
  redis: true
  preconnect: true
  servers:
    - redis-shard-0:6379:1 redis-shard-0 
    - redis-shard-1:6379:1 redis-shard-1 
    - redis-shard-2:6379:1 redis-shard-2 
    - redis-shard-3:6379:1 redis-shard-3 
    - redis-shard-4:6379:1 redis-shard-4 
    - redis-shard-5:6379:1 redis-shard-5 
    - redis-shard-6:6379:1 redis-shard-6 
    - redis-shard-7:6379:1 redis-shard-7 
    - redis-shard-8:6379:1 redis-shard-8 
    - redis-shard-9:6379:1 redis-shard-9 
    - redis-shard-10:6379:1 redis-shard-10 
    - redis-shard-11:6379:1 redis-shard-11 
    - redis-shard-12:6379:1 redis-shard-12 
    - redis-shard-13:6379:1 redis-shard-13 
    - redis-shard-14:6379:1 redis-shard-14 
    - redis-shard-15:6379:1 redis-shard-15 
    - redis-shard-16:6379:1 redis-shard-16 
    - redis-shard-17:6379:1 redis-shard-17 
    - redis-shard-18:6379:1 redis-shard-18 
    - redis-shard-19:6379:1 redis-shard-19 
```

### Envoy Redis Proxy 
My first attempt was to setup a basic Envoy configuration with the Redis Proxy filter and use 10 instances as well, figuring that someone had already figured out an appropriate scale and I might as well use the same. The setup looked like this (using latest, v1.21):
```
admin:
  access_log:
    name: envoy.access_loggers.file
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
      path: /tmp/admin_access.log
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901
node:
  id: redis-envoy
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 6379 }
    filter_chains:
    - filters:
      - name: envoy.filters.network.redis_proxy
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.redis_proxy.v3.RedisProxy
          stat_prefix: egress_redis
          settings:
            op_timeout: 1s
            enable_hashtagging: true
            enable_redirection: false
            enable_command_stats: false
          prefix_routes:
            catch_all_route:
              cluster: redis-cluster
  clusters:
    - name: redis-cluster
      connect_timeout: 0.5s
      type: STRICT_DNS
      dns_lookup_family: V4_ONLY
      lb_policy: RING_HASH
      load_assignment:
        cluster_name: redis
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-0, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-1, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-2, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-3, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-4, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-5, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-6, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-7, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-8, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-9, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-10, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-11, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-12, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-13, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-14, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-15, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-16, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-17, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-18, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-19, port_value: 6379 }
cluster_manager:
  outlier_detection:
    event_log_path: /dev/stdout
layered_runtime:
  layers:
    - name: admin_layer
      admin_layer: {}
```
The `ketama` key distribution in twemproxy maps to the `RING_HASH` lb_policy in Envoy, and the other settings are quite similar though I did increase the timeout somewhat in an attempt to be conservative. I switched this over (changing the internal service's selector from our twemproxy label to use the redis-envoy label) during a low traffic period between Christmas and New Year's and everything seemed stable.

### Problem
When our traffic ramped back up in early January, we started experiencing latency issues. Redis queries which were taking single digit milliseconds were now timing out after 1 second. I switched back to twemproxy and researched our configuration a little more.

After a few days, I noticed that the buffering setting isn't enabled by default, so added the following to our redis proxy settings:
```
            max_buffer_size_before_flush: 1024
            buffer_flush_timeout: 0.003s
```
This helped a little, but we still saw a spike in latency and I switched back to twemproxy again.

Our p95 latency for a PUT operation on Redis:
![Screen Shot 2022-01-05 at 1 51 24 PM](https://user-images.githubusercontent.com/591998/148461015-335e0326-2867-409f-bc92-adf1697373de.png)
On the left you can see it exceeded 1 second, and then at the right you can see there was still a spike even after adding the buffering settings.

What I suspect was the issue here is that the average number of connected clients per Redis instance jumped from 11 (10 twemproxy + 1 monitoring) to 161 (10 x 16 Envoy threads + 1 monitoring) and Envoy experienced additional contention with the 160 outbound threads.

Rather than running 10 low-spec instances that were identical to twemproxy's settings (1 CPU / 2 GB RAM), I switched this to run 3 higher spec instances with 4 GB CPU, but still 2 GB RAM. After this, things stabilized though there are still a few areas of concern.

### Some graphs, questions
![Screen Shot 2022-01-05 at 1 21 41 PM](https://user-images.githubusercontent.com/591998/148463120-1a2bac5f-7b13-41f2-9095-9ca4c7c0be67.png)
twemproxy seems to maintain only a single connection to each Redis instance and proxies all commands through that connection. Envoy maintains 16 connections to each Redis instance.

![Screen Shot 2022-01-05 at 1 23 05 PM](https://user-images.githubusercontent.com/591998/148463370-2c5c1a75-d725-43a2-8a2f-09849a1a3233.png)
Even after achieving stability with Envoy after increasing its resources, the number of commands executed on Redis has increased substantially despite the same workload. The left of the graph is on twemproxy, the right is after switching over to Envoy. Why would the same workload result multiply the number of executed commands? Does Redis Proxy open a new connection per command and then issue a QUIT afterwards whereas twemproxy maintains a persistent connection?

![Screen Shot 2022-01-05 at 1 25 21 PM](https://user-images.githubusercontent.com/591998/148463641-18e50142-fa50-4c76-b694-d7977957d887.png)
The graph of redis operations/second largely mirrors the number of commands. Here, too, we see a multiple of what it was after switching from twemproxy.

![Screen Shot 2022-01-06 at 2 57 26 PM](https://user-images.githubusercontent.com/591998/148464152-07203dd7-04c0-4d6f-ab8e-8fa69843836a.png)
Our p95 latency is now roughly similar to what it was a week ago when we ran twemproxy, though Envoy is still a few milliseconds slower. I suspect if I increase the CPU allocation from 4 to 8 that this will get us much closer to twemproxy's performance.

### Summary and questions
Overall, switching was relatively painless despite a couple of challenges. The differences in running twemproxy vs Envoy added a few surprises, which could be addressed via a migration guide. I'm happy to draft something up -- let me know if that'd be useful.

So some outstanding questions based on the above:

1. Is there anything in the configuration which is either incorrect or could be improved?
2. Does Envoy have a specific recommendation in terms of CPU sizing? Note that we don't specify a `--concurrency` setting. Should we?
3. Why are we seeing a tripling of the number of commands despite a constant workload?
4. Anything else you might suggest to improve overall performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Our journey switching from twemproxy to Envoy Redis Proxy (and some perf issues) #19436

twemproxy configuration

Envoy Redis Proxy

Problem

Some graphs, questions

Summary and questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Our journey switching from twemproxy to Envoy Redis Proxy (and some perf issues) #19436

Description

twemproxy configuration

Envoy Redis Proxy

Problem

Some graphs, questions

Summary and questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions