-
-
Notifications
You must be signed in to change notification settings - Fork 866
refactor: More granular locking in WorkerCache #9118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this cleans up the locking. Feels easier to read. I don't think there was any race condition previously though. You've added more granular locking and made it easier to read the code. Maybe you could explain how these race conditions were occurring as I couldn't reason it out looking at the old code.
Multiple goroutines could trigger redundant DB refreshes
Not sure how in the previous implementation both go routines could cause db refreshes. The default refresh interval is also hard-coded to 1 minute.
Looking at the original implementation, if two go routines called refreshWorkerData()
at the same time:
go routine 1 | go routine 2 |
---|---|
gets the lock | waits for lock |
needsRefresh() returns true |
waits for lock |
refreshes data | waits for lock |
updates refresh time | waits for lock |
releases lock | gets lock |
needsRefresh() returns false |
Is there a different way that could play out, even for N number of go routines? I don't see how they previously could have queried the db back to back. Once the next go routine gets the lock they'd find out no refresh is required.
This file also doesn't have extensive tests coverage. There's literally one test covering it: https://github.com/concourse/concourse/blob/master/atc/db/worker_cache_test.go
I'm not suggesting to add tests, just pointing out that this code doesn't have a big test harness like most other areas in atc/db/
do. We also do run tests with the -race
flag which catches race conditions if they're present. I used it a lot when I made the notify listener for pgx. I don't recall seeing the tests for this file fail from race conditions being detected.
The previous implementation suffered from several race conditions: 1. In upsertWorker(), stale worker data could be cached after a worker was deleted 2. Container counts could be tracked for non-existent workers 3. Multiple goroutines could trigger redundant DB refreshes This commit resolves these issues by: - Splitting the single mutex into separate locks for data and refresh operations - Using atomic operations for refresh state tracking - Adding proper worker existence validation - Implementing safer refresh coordination - Reducing lock contention by keeping DB operations outside critical sections Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
f9ce983
to
1964b55
Compare
Signed-off-by: Mathias Bogaert <254434+analytically@users.noreply.github.com>
* maps was not imported which resulted in an error when calling maps.Copy() * initialized maps while initializing the struct * Fixed values passed into CompareAndSwap() as they weren't updated after the underlying type was changed from an int to a bool * nit: removed some verbose commenting. Felt like AI commenting up the code. Signed-off-by: Taylor Silva <dev@taydev.net>
1964b55
to
fdb80e6
Compare
The previous implementation suffered from several race conditions:In upsertWorker(), stale worker data could be cached after a worker was deletedContainer counts could be tracked for non-existent workersMultiple goroutines could trigger redundant DB refreshesThis PR refactors the current implementation by: