Skip to content

Conversation

analytically
Copy link
Contributor

@analytically analytically commented Mar 14, 2025

The previous implementation suffered from several race conditions:

  1. In upsertWorker(), stale worker data could be cached after a worker was deleted
  2. Container counts could be tracked for non-existent workers
  3. Multiple goroutines could trigger redundant DB refreshes

This PR refactors the current implementation by:

  • Splitting the single mutex into separate locks for data and refresh operations
  • Using atomic operations for refresh state tracking
  • Adding proper worker existence validation
  • Implementing safer refresh coordination
  • Reducing lock contention by keeping DB operations outside critical sections

@analytically analytically requested a review from a team as a code owner March 14, 2025 10:35
Copy link
Member

@taylorsilva taylorsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this cleans up the locking. Feels easier to read. I don't think there was any race condition previously though. You've added more granular locking and made it easier to read the code. Maybe you could explain how these race conditions were occurring as I couldn't reason it out looking at the old code.

Multiple goroutines could trigger redundant DB refreshes

Not sure how in the previous implementation both go routines could cause db refreshes. The default refresh interval is also hard-coded to 1 minute.

Looking at the original implementation, if two go routines called refreshWorkerData() at the same time:

go routine 1 go routine 2
gets the lock waits for lock
needsRefresh() returns true waits for lock
refreshes data waits for lock
updates refresh time waits for lock
releases lock gets lock
needsRefresh() returns false

Is there a different way that could play out, even for N number of go routines? I don't see how they previously could have queried the db back to back. Once the next go routine gets the lock they'd find out no refresh is required.

This file also doesn't have extensive tests coverage. There's literally one test covering it: https://github.com/concourse/concourse/blob/master/atc/db/worker_cache_test.go
I'm not suggesting to add tests, just pointing out that this code doesn't have a big test harness like most other areas in atc/db/ do. We also do run tests with the -race flag which catches race conditions if they're present. I used it a lot when I made the notify listener for pgx. I don't recall seeing the tests for this file fail from race conditions being detected.

@taylorsilva taylorsilva changed the title Fix race conditions in WorkerCache implementation, improve docs refactor: More granular locking in WorkerCache Apr 4, 2025
@taylorsilva taylorsilva moved this to Todo in Pull Requests Apr 30, 2025
@taylorsilva taylorsilva moved this from Todo to In Progress in Pull Requests Apr 30, 2025
The previous implementation suffered from several race conditions:
1. In upsertWorker(), stale worker data could be cached after a worker was deleted
2. Container counts could be tracked for non-existent workers
3. Multiple goroutines could trigger redundant DB refreshes

This commit resolves these issues by:
- Splitting the single mutex into separate locks for data and refresh operations
- Using atomic operations for refresh state tracking
- Adding proper worker existence validation
- Implementing safer refresh coordination
- Reducing lock contention by keeping DB operations outside critical sections

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
analytically and others added 2 commits April 30, 2025 12:24
Signed-off-by: Mathias Bogaert <254434+analytically@users.noreply.github.com>
* maps was not imported which resulted in an error when calling
  maps.Copy()
* initialized maps while initializing the struct
* Fixed values passed into CompareAndSwap() as they weren't updated
  after the underlying type was changed from an int to a bool
* nit: removed some verbose commenting. Felt like AI commenting up the
  code.

Signed-off-by: Taylor Silva <dev@taydev.net>
@taylorsilva taylorsilva merged commit e715419 into concourse:master Apr 30, 2025
12 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Pull Requests Apr 30, 2025
@analytically analytically deleted the workercache branch April 30, 2025 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants