-
-
Notifications
You must be signed in to change notification settings - Fork 867
Description
Hi,
We're having concourse jobs stuck in pending state for a very long time, all while seeing lots of slow queries on our postgres server. Also seeing resources being slow to trigger, sometimes taking hours, even when check_every
not set (defaulting to 1m). When things get really bad, the list of jobs on the left in the GUI is not updated.
The first time this happened, we got some relief by reducing amount of logs being retained via build_logs_to_retain
for a job on a 3m timer.
The problem recently struck again and we found another job that needed the build_logs_to_retain
treatment. However, once addressed, we still the pending state and slow triggers. Also seeing the job run but it take minutes for tasks to get going.
Clues in database:
- Several instances of query
REFRESH MATERIALIZED VIEW CONCURRENTLY latest_completed_builds_per_job
are often running in parallel. Nominally taking a few seconds when things seem fine, often minutes or over an hour when experiencing aforementioned issues. - Load average on database often 30-40, with 8 cores at hand.
- Usually seeing 40-50 active connections into database, sometimes over 100 reported, from concourse
- Table
concourseinfrastructureprod.builds
went from 20000+ down to 4000 when we most recently removed/refactored a job with excessive build logs retained.
Other details:
- Concourse version: 3.8.0
- Deployment type: BOSH
- Infrastructure/IaaS: Openstack
- Browser (if applicable): Chrome
- Database - Postgres 9.5, 8 core CPU, no significant io wait
- 5 workers
- Concourse web and worker VMs all sized reasonably to our knowledge (low cpu/load)
- Did this used to work? - Yes
Please help the Concourse fans on my team keep using Concourse, so that mgmt doesn't force us on to Jenkins. :)
Thanks!
Aaron