refactor: More granular locking in WorkerCache #9118

analytically · 2025-03-14T10:35:56Z

~~The previous implementation suffered from several race conditions~~:

~~In upsertWorker(), stale worker data could be cached after a worker was deleted~~
~~Container counts could be tracked for non-existent workers~~
~~Multiple goroutines could trigger redundant DB refreshes~~

This PR refactors the current implementation by:

Splitting the single mutex into separate locks for data and refresh operations
Using atomic operations for refresh state tracking
Adding proper worker existence validation
Implementing safer refresh coordination
Reducing lock contention by keeping DB operations outside critical sections

taylorsilva

I like how this cleans up the locking. Feels easier to read. I don't think there was any race condition previously though. You've added more granular locking and made it easier to read the code. Maybe you could explain how these race conditions were occurring as I couldn't reason it out looking at the old code.

Multiple goroutines could trigger redundant DB refreshes

Not sure how in the previous implementation both go routines could cause db refreshes. The default refresh interval is also hard-coded to 1 minute.

Looking at the original implementation, if two go routines called refreshWorkerData() at the same time:

go routine 1	go routine 2
gets the lock	waits for lock
`needsRefresh()` returns true	waits for lock
refreshes data	waits for lock
updates refresh time	waits for lock
releases lock	gets lock
	`needsRefresh()` returns false

Is there a different way that could play out, even for N number of go routines? I don't see how they previously could have queried the db back to back. Once the next go routine gets the lock they'd find out no refresh is required.

This file also doesn't have extensive tests coverage. There's literally one test covering it: https://github.com/concourse/concourse/blob/master/atc/db/worker_cache_test.go
I'm not suggesting to add tests, just pointing out that this code doesn't have a big test harness like most other areas in atc/db/ do. We also do run tests with the -race flag which catches race conditions if they're present. I used it a lot when I made the notify listener for pgx. I don't recall seeing the tests for this file fail from race conditions being detected.

atc/db/worker_cache.go

The previous implementation suffered from several race conditions: 1. In upsertWorker(), stale worker data could be cached after a worker was deleted 2. Container counts could be tracked for non-existent workers 3. Multiple goroutines could trigger redundant DB refreshes This commit resolves these issues by: - Splitting the single mutex into separate locks for data and refresh operations - Using atomic operations for refresh state tracking - Adding proper worker existence validation - Implementing safer refresh coordination - Reducing lock contention by keeping DB operations outside critical sections Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>

Signed-off-by: Mathias Bogaert <254434+analytically@users.noreply.github.com>

* maps was not imported which resulted in an error when calling maps.Copy() * initialized maps while initializing the struct * Fixed values passed into CompareAndSwap() as they weren't updated after the underlying type was changed from an int to a bool * nit: removed some verbose commenting. Felt like AI commenting up the code. Signed-off-by: Taylor Silva <dev@taydev.net>

analytically requested a review from a team as a code owner March 14, 2025 10:35

taylorsilva requested changes Apr 4, 2025

View reviewed changes

atc/db/worker_cache.go Outdated Show resolved Hide resolved

atc/db/worker_cache.go Outdated Show resolved Hide resolved

atc/db/worker_cache.go Outdated Show resolved Hide resolved

atc/db/worker_cache.go Outdated Show resolved Hide resolved

taylorsilva added the misc label Apr 4, 2025

taylorsilva changed the title ~~Fix race conditions in WorkerCache implementation, improve docs~~ refactor: More granular locking in WorkerCache Apr 4, 2025

taylorsilva added this to Pull Requests Apr 30, 2025

taylorsilva moved this to Todo in Pull Requests Apr 30, 2025

taylorsilva moved this from Todo to In Progress in Pull Requests Apr 30, 2025

taylorsilva force-pushed the workercache branch from f9ce983 to 1964b55 Compare April 30, 2025 16:23

analytically and others added 2 commits April 30, 2025 12:24

Update atc/db/worker_cache.go

e785028

Signed-off-by: Mathias Bogaert <254434+analytically@users.noreply.github.com>

taylorsilva force-pushed the workercache branch from 1964b55 to fdb80e6 Compare April 30, 2025 16:25

taylorsilva approved these changes Apr 30, 2025

View reviewed changes

taylorsilva merged commit e715419 into concourse:master Apr 30, 2025
12 checks passed

github-project-automation bot moved this from In Progress to Done in Pull Requests Apr 30, 2025

analytically deleted the workercache branch April 30, 2025 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor: More granular locking in WorkerCache #9118

refactor: More granular locking in WorkerCache #9118

Uh oh!

analytically commented Mar 14, 2025 •

edited by taylorsilva

Loading

Uh oh!

taylorsilva left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

refactor: More granular locking in WorkerCache #9118

refactor: More granular locking in WorkerCache #9118

Uh oh!

Conversation

analytically commented Mar 14, 2025 • edited by taylorsilva Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorsilva left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

analytically commented Mar 14, 2025 •

edited by taylorsilva

Loading

taylorsilva left a comment •

edited

Loading