Skip to content

fork_worker worker termination during phased-restart #2786

@vitiokss

Description

@vitiokss

Describe the bug

During phased-restart, while running with fork_worker option, worker 0 stops to receive ping message. While the puma process is running, constantly we are doing timeout_workers, which checks if workers are responding. Because worker 0, is not receiving "p" message (ping), it is not updating @last_checkin value (WorkerHandle class), which leads to throwing Terminating timed out worker... error after exceeding timeout (default: 60), which kills it and starts the loop once again. We end up in the state where puma constantly restarting its workers.

Puma config:

threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count

port        ENV.fetch("PORT") { 3000 }

environment ENV.fetch("RAILS_ENV") { "development" }

preload_app! false

fork_worker 0

To Reproduce
It is quite easy to reproduce this using "hello world" rack application. The only thing you need to do, just run the application using a lot of workers, in my example 32 was more than enough to trigger the behavior.

Spin puma process:
bundle exec puma -C config/puma.rb -e development --threads 4:16 --workers 32 --tag test --pidfile tmp/puma.pid --state tmp/puma.state --dir . --control-url auto hello.ru

Trigger phased-restart:
pumactl -s tmp/puma.state phased-restart

Logs with additional puts

w.ping!(result.sub(/^\d+/,'').chomp)

in when "p" add p "worker: #{w.index}, pid: #{w.pid} received p msg!" if w.index == 0 [lib/puma/cluster.rb:448]

if !w.term? && w.ping_timeout <= Time.now

add p "timeout_workers: #{w.ping_timeout}, pid: #{w.pid}, booted: #{w.booted?}, index: #{w.index}, term: #{w.term?}" if w.index == 0 [lib/puma/cluster.rb:518]

Log looks like this:
It is easy to see, once phased restart started ping message is not sent to worker 0, and ping_timeout value for worker 0 is not being updated, even though it is healthy and responding.

"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
[9814] - Worker 31 (PID: 9858) booted in 0.07s, phase: 0
"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:12 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:17 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
[9814] - Starting phased worker restart, phase: 1
[9814] + Changing to .
"timeout_workers: 2022-01-04 19:04:22 +0200, pid: 9827, booted: true, index: 0, term: false"
[9814] - Stopping 9827 for phased upgrade...
[9814] - TERM sent to 9827...
"timeout_workers: 2022-01-04 19:04:22 +0200, pid: 9827, booted: true, index: 0, term: true"
[9814] - Worker 0 (PID: 9878) booted in 0.02s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9828 for phased upgrade...
[9814] - TERM sent to 9828...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 1 (PID: 9891) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9829 for phased upgrade...
[9814] - TERM sent to 9829...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 2 (PID: 9892) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9830 for phased upgrade...
[9814] - TERM sent to 9830...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 3 (PID: 9894) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9831 for phased upgrade...
[9814] - TERM sent to 9831...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 4 (PID: 9895) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
**[9814] - Stopping 9832 for phased upgrade...**
[9814] - TERM sent to 9832...
...

Expected behavior
Once worker 0 is restarted we should send a ping "p" message to its thread as well, and not wait until all the workers is up and running.

Desktop:

  • OS: Mac
  • Puma Version: 5.5.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions