Skip to content

Potential memory/goroutines leaks #8638

@gaelL

Description

@gaelL

Summary

We upgraded from 6.7..6 to 7.8.3.
Our concourse web have a bunch of pipelines, teams and dedicated team workers.

Since the upgrade we identified memory growing until it got killed by OOM

And in prometheus metrics we can see goroutines number increase too.

image

image

And something interesting

image

Steps to reproduce

Not sure

Expected results

Actual results

Additional context

This Concourse server run only one web component. This instance is running since Concourse version 4 and have quite lot of team and pipelines.
On this Concourse several workers are attached to specific teams (there is no global/default workers).
The database is around 8Go, we have quite lot of resource versions since this cluster is quite old.

The database CPU usage is quite high (3 vCPUs, 4 GB RAM)
image

Slow db logs

2022-11-24 01:34:55 UTC [6129]: user=concourse,db=concourse,app=[unknown],client=51.159.74.196 LOG:  00000: process 6129 acquired ExclusiveLock on tuple (40,125) of relation 1733315 of database 1730368 after 1028.441 ms
2022-11-24 01:34:55 UTC [6129]: user=concourse,db=concourse,app=[unknown],client=51.159.74.196 LOCATION:  ProcSleep, proc.c:1495
2022-11-24 01:34:55 UTC [6129]: user=concourse,db=concourse,app=[unknown],client=51.159.74.196 STATEMENT:
            UPDATE resource_config_scopes
            SET last_check_start_time = now(), last_check_build_id = $1, last_check_build_plan = $2
            WHERE id = $3

2022-11-24 01:34:55 UTC [6150]: user=concourse,db=concourse,app=[unknown],client=51.159.74.196 LOG:  00000: process 6150 acquired ShareLock on transaction 2106307433 after 1119.776 ms
2022-11-24 01:34:55 UTC [6150]: user=concourse,db=concourse,app=[unknown],client=51.159.74.196 CONTEXT:  while updating tuple (241,88) in relation "resource_config_scopes"
2022-11-24 01:34:55 UTC [6150]: user=concourse,db=concourse,app=[unknown],client=51.159.74.196 LOCATION:  ProcSleep, proc.c:1495
2022-11-24 01:34:55 UTC [6150]: user=concourse,db=concourse,app=[unknown],client=51.159.74.196 STATEMENT:
            UPDATE resource_config_scopes
            SET last_check_start_time = now(), last_check_build_id = $1, last_check_build_plan = $2
            WHERE id = $3

traces

Some traces I managed to get with curl http://localhost:$CONCOURSE_DEBUG_BIND_PORT/debug/pprof/goroutine?debug=1

goroutine profile: total 353596
349824 @ 0x43ca56 0x44d99e 0x44d975 0x46a585 0x478945 0x941e2e 0x941e0e 0x941cf7 0x941b26 0xc4e96d 0xe966ef 0xe8a89e 0x46e781
#   0x46a584    sync.runtime_SemacquireMutex+0x24                           runtime/sema.go:77
#   0x478944    sync.(*Mutex).lockSlow+0x164                                sync/mutex.go:171
#   0x941e2d    sync.(*Mutex).Lock+0x6d                                 sync/mutex.go:90
#   0x941e0d    github.com/concourse/concourse/atc/db/lock.(*lock).Acquire+0x4d             github.com/concourse/concourse/atc/db/lock/lock.go:217
#   0x941cf6    github.com/concourse/concourse/atc/db/lock.(*lockFactory).Acquire+0x156         github.com/concourse/concourse/atc/db/lock/lock.go:175
#   0x941b25    github.com/concourse/concourse/atc/db/lock.lockFactories.Acquire+0xa5           github.com/concourse/concourse/atc/db/lock/lock.go:161
#   0xc4e96c    github.com/concourse/concourse/atc/db.(*inMemoryCheckBuild).AcquireTrackingLock+0xec    github.com/concourse/concourse/atc/db/build_in_memory_check.go:406
#   0xe966ee    github.com/concourse/concourse/atc/engine.(*engineBuild).Run+0x16e          github.com/concourse/concourse/atc/engine/engine.go:115
#   0xe8a89d    github.com/concourse/concourse/atc/builds.(*Tracker).trackBuild.func1+0x31d     github.com/concourse/concourse/atc/builds/tracker.go:109

820 @ 0x43ca56 0x44c8dc 0xe96de5 0xe8a89e 0x46e781
#   0xe96de4    github.com/concourse/concourse/atc/engine.(*engineBuild).Run+0x864      github.com/concourse/concourse/atc/engine/engine.go:218
#   0xe8a89d    github.com/concourse/concourse/atc/builds.(*Tracker).trackBuild.func1+0x31d github.com/concourse/concourse/atc/builds/tracker.go:109

image

Workaround

We managed to find a workaround by increasing component-runner-interval from 10s to 30s using
CONCOURSE_COMPONENT_RUNNER_INTERVAL : 30s

The topic has been raised on Discord here https://discord.com/channels/219899946617274369/413770960089382922/1045614009317077022

Triaging info

  • Concourse version: 7.8.3
  • Browser (if applicable):
  • Did this used to work?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions