-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Describe the bug
During phased-restart
, while running with fork_worker
option, worker 0 stops to receive ping message. While the puma process is running, constantly we are doing timeout_workers
, which checks if workers are responding. Because worker 0, is not receiving "p" message (ping), it is not updating @last_checkin
value (WorkerHandle class), which leads to throwing Terminating timed out worker...
error after exceeding timeout (default: 60), which kills it and starts the loop once again. We end up in the state where puma
constantly restarting its workers.
Puma config:
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count
port ENV.fetch("PORT") { 3000 }
environment ENV.fetch("RAILS_ENV") { "development" }
preload_app! false
fork_worker 0
To Reproduce
It is quite easy to reproduce this using "hello world" rack application. The only thing you need to do, just run the application using a lot of workers, in my example 32 was more than enough to trigger the behavior.
Spin puma process:
bundle exec puma -C config/puma.rb -e development --threads 4:16 --workers 32 --tag test --pidfile tmp/puma.pid --state tmp/puma.state --dir . --control-url auto hello.ru
Trigger phased-restart:
pumactl -s tmp/puma.state phased-restart
Logs with additional puts
Line 454 in cec07ac
w.ping!(result.sub(/^\d+/,'').chomp) |
in
when "p"
add p "worker: #{w.index}, pid: #{w.pid} received p msg!" if w.index == 0
[lib/puma/cluster.rb:448]
Line 516 in a2bcda4
if !w.term? && w.ping_timeout <= Time.now |
add
p "timeout_workers: #{w.ping_timeout}, pid: #{w.pid}, booted: #{w.booted?}, index: #{w.index}, term: #{w.term?}" if w.index == 0
[lib/puma/cluster.rb:518]
Log looks like this:
It is easy to see, once phased restart started ping message is not sent to worker 0, and ping_timeout
value for worker 0 is not being updated, even though it is healthy and responding.
"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
[9814] - Worker 31 (PID: 9858) booted in 0.07s, phase: 0
"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:12 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:17 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
[9814] - Starting phased worker restart, phase: 1
[9814] + Changing to .
"timeout_workers: 2022-01-04 19:04:22 +0200, pid: 9827, booted: true, index: 0, term: false"
[9814] - Stopping 9827 for phased upgrade...
[9814] - TERM sent to 9827...
"timeout_workers: 2022-01-04 19:04:22 +0200, pid: 9827, booted: true, index: 0, term: true"
[9814] - Worker 0 (PID: 9878) booted in 0.02s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9828 for phased upgrade...
[9814] - TERM sent to 9828...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 1 (PID: 9891) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9829 for phased upgrade...
[9814] - TERM sent to 9829...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 2 (PID: 9892) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9830 for phased upgrade...
[9814] - TERM sent to 9830...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 3 (PID: 9894) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9831 for phased upgrade...
[9814] - TERM sent to 9831...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 4 (PID: 9895) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
**[9814] - Stopping 9832 for phased upgrade...**
[9814] - TERM sent to 9832...
...
Expected behavior
Once worker 0 is restarted we should send a ping "p" message to its thread as well, and not wait until all the workers is up and running.
Desktop:
- OS: Mac
- Puma Version: 5.5.2