chore: increase timeout for rabbitmq probe #9030

jennifer-richards · 2025-06-19T18:43:48Z

Liveness probes for the RabbitMQ pod are failing occasionally in production, sometimes leading to the pod being terminated and replaced. Other than interruptions caused by the roll-over, there are no signs of problems with the service. Notably, the celery worker is processing jobs without apparent interruption, which indicates that the message queue is operating. RabbitMQ itself does not report any errors, and its memory / CPU usage are not remarkable. There are some indications that the k8s node might be busy at the time of the mq pod restart (synthetics checks had slow responses at around the same time).

My suspicion is that once in a while, perhaps during heavy load, the rabbitmq-diagnostics ping command we use is taking too long to execute. We're using a short (5s) timeout on the liveness probe. This bumps the timeout to 30 seconds on the ping command, which was using its default infinite timeout. The k8s livenessProbe config timeout is set to 35 seconds to allow time for the command to start / exit.

chore: increase timeout for rabbitmq probe

74330ce

jennifer-richards requested review from rjsparks and NGPixel June 19, 2025 18:43

NGPixel approved these changes Jun 19, 2025

View reviewed changes

rjsparks merged commit e93a56b into ietf-tools:main Jun 20, 2025
2 checks passed

jennifer-richards deleted the waiting-for-rabbits branch June 20, 2025 19:47

github-actions bot locked as resolved and limited conversation to collaborators Jun 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: increase timeout for rabbitmq probe #9030

chore: increase timeout for rabbitmq probe #9030

Uh oh!

jennifer-richards commented Jun 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

chore: increase timeout for rabbitmq probe #9030

chore: increase timeout for rabbitmq probe #9030

Uh oh!

Conversation

jennifer-richards commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jennifer-richards commented Jun 19, 2025 •

edited

Loading