Skip to content

chore: increase timeout for rabbitmq probe #9030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 20, 2025

Conversation

jennifer-richards
Copy link
Member

@jennifer-richards jennifer-richards commented Jun 19, 2025

Liveness probes for the RabbitMQ pod are failing occasionally in production, sometimes leading to the pod being terminated and replaced. Other than interruptions caused by the roll-over, there are no signs of problems with the service. Notably, the celery worker is processing jobs without apparent interruption, which indicates that the message queue is operating. RabbitMQ itself does not report any errors, and its memory / CPU usage are not remarkable. There are some indications that the k8s node might be busy at the time of the mq pod restart (synthetics checks had slow responses at around the same time).

My suspicion is that once in a while, perhaps during heavy load, the rabbitmq-diagnostics ping command we use is taking too long to execute. We're using a short (5s) timeout on the liveness probe. This bumps the timeout to 30 seconds on the ping command, which was using its default infinite timeout. The k8s livenessProbe config timeout is set to 35 seconds to allow time for the command to start / exit.

@rjsparks rjsparks merged commit e93a56b into ietf-tools:main Jun 20, 2025
2 checks passed
@jennifer-richards jennifer-richards deleted the waiting-for-rabbits branch June 20, 2025 19:47
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 24, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants