-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Description
Some background: we have 20 Redis instances serving our API plus various other backend jobs (collectively, our "applications"). In front of this, we had 10 twemproxy instances proxying traffic based on consistent hash of the key. All of this is within Kubernetes. Our applications access the proxy via Kubernetes internal service based on a label selector. Our Redis instances are setup as a statefulset with each pod in the set having a custom instance name, from redis-shard-0
through redis-shard-19
.
twemproxy configuration
default:
listen: 0.0.0.0:6379
hash: fnv1a_64
hash_tag: "{}"
distribution: ketama
auto_eject_hosts: false
timeout: 400
redis: true
preconnect: true
servers:
- redis-shard-0:6379:1 redis-shard-0
- redis-shard-1:6379:1 redis-shard-1
- redis-shard-2:6379:1 redis-shard-2
- redis-shard-3:6379:1 redis-shard-3
- redis-shard-4:6379:1 redis-shard-4
- redis-shard-5:6379:1 redis-shard-5
- redis-shard-6:6379:1 redis-shard-6
- redis-shard-7:6379:1 redis-shard-7
- redis-shard-8:6379:1 redis-shard-8
- redis-shard-9:6379:1 redis-shard-9
- redis-shard-10:6379:1 redis-shard-10
- redis-shard-11:6379:1 redis-shard-11
- redis-shard-12:6379:1 redis-shard-12
- redis-shard-13:6379:1 redis-shard-13
- redis-shard-14:6379:1 redis-shard-14
- redis-shard-15:6379:1 redis-shard-15
- redis-shard-16:6379:1 redis-shard-16
- redis-shard-17:6379:1 redis-shard-17
- redis-shard-18:6379:1 redis-shard-18
- redis-shard-19:6379:1 redis-shard-19
Envoy Redis Proxy
My first attempt was to setup a basic Envoy configuration with the Redis Proxy filter and use 10 instances as well, figuring that someone had already figured out an appropriate scale and I might as well use the same. The setup looked like this (using latest, v1.21):
admin:
access_log:
name: envoy.access_loggers.file
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: /tmp/admin_access.log
address:
socket_address:
address: 0.0.0.0
port_value: 9901
node:
id: redis-envoy
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 6379 }
filter_chains:
- filters:
- name: envoy.filters.network.redis_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.redis_proxy.v3.RedisProxy
stat_prefix: egress_redis
settings:
op_timeout: 1s
enable_hashtagging: true
enable_redirection: false
enable_command_stats: false
prefix_routes:
catch_all_route:
cluster: redis-cluster
clusters:
- name: redis-cluster
connect_timeout: 0.5s
type: STRICT_DNS
dns_lookup_family: V4_ONLY
lb_policy: RING_HASH
load_assignment:
cluster_name: redis
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address: { address: redis-shard-0, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-1, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-2, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-3, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-4, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-5, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-6, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-7, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-8, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-9, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-10, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-11, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-12, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-13, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-14, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-15, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-16, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-17, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-18, port_value: 6379 }
- endpoint:
address:
socket_address: { address: redis-shard-19, port_value: 6379 }
cluster_manager:
outlier_detection:
event_log_path: /dev/stdout
layered_runtime:
layers:
- name: admin_layer
admin_layer: {}
The ketama
key distribution in twemproxy maps to the RING_HASH
lb_policy in Envoy, and the other settings are quite similar though I did increase the timeout somewhat in an attempt to be conservative. I switched this over (changing the internal service's selector from our twemproxy label to use the redis-envoy label) during a low traffic period between Christmas and New Year's and everything seemed stable.
Problem
When our traffic ramped back up in early January, we started experiencing latency issues. Redis queries which were taking single digit milliseconds were now timing out after 1 second. I switched back to twemproxy and researched our configuration a little more.
After a few days, I noticed that the buffering setting isn't enabled by default, so added the following to our redis proxy settings:
max_buffer_size_before_flush: 1024
buffer_flush_timeout: 0.003s
This helped a little, but we still saw a spike in latency and I switched back to twemproxy again.
Our p95 latency for a PUT operation on Redis:
On the left you can see it exceeded 1 second, and then at the right you can see there was still a spike even after adding the buffering settings.
What I suspect was the issue here is that the average number of connected clients per Redis instance jumped from 11 (10 twemproxy + 1 monitoring) to 161 (10 x 16 Envoy threads + 1 monitoring) and Envoy experienced additional contention with the 160 outbound threads.
Rather than running 10 low-spec instances that were identical to twemproxy's settings (1 CPU / 2 GB RAM), I switched this to run 3 higher spec instances with 4 GB CPU, but still 2 GB RAM. After this, things stabilized though there are still a few areas of concern.
Some graphs, questions
twemproxy seems to maintain only a single connection to each Redis instance and proxies all commands through that connection. Envoy maintains 16 connections to each Redis instance.
Even after achieving stability with Envoy after increasing its resources, the number of commands executed on Redis has increased substantially despite the same workload. The left of the graph is on twemproxy, the right is after switching over to Envoy. Why would the same workload result multiply the number of executed commands? Does Redis Proxy open a new connection per command and then issue a QUIT afterwards whereas twemproxy maintains a persistent connection?
The graph of redis operations/second largely mirrors the number of commands. Here, too, we see a multiple of what it was after switching from twemproxy.
Our p95 latency is now roughly similar to what it was a week ago when we ran twemproxy, though Envoy is still a few milliseconds slower. I suspect if I increase the CPU allocation from 4 to 8 that this will get us much closer to twemproxy's performance.
Summary and questions
Overall, switching was relatively painless despite a couple of challenges. The differences in running twemproxy vs Envoy added a few surprises, which could be addressed via a migration guide. I'm happy to draft something up -- let me know if that'd be useful.
So some outstanding questions based on the above:
- Is there anything in the configuration which is either incorrect or could be improved?
- Does Envoy have a specific recommendation in terms of CPU sizing? Note that we don't specify a
--concurrency
setting. Should we? - Why are we seeing a tripling of the number of commands despite a constant workload?
- Anything else you might suggest to improve overall performance?