Skip to content

Fixed data race in parallel Submit and Stop calls #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

korotin
Copy link
Collaborator

@korotin korotin commented Mar 28, 2025

Hey @alitto !

I discovered that calling Submit after Close could lead to data race in WaitGroup since you cannot call WaitGroup.Add while WaitGroup.Wait is running.

To fix this I replaced WaitGroup with custom synchronization primitive called Stopper (yep, naming isn't my suit) which replicates WaitGroup behavior but properly handles AddWait race.

Can't say that I'm happy enough with my solution, so I'd be glad to hear your thoughts!

@korotin korotin requested a review from alitto March 28, 2025 16:07
@korotin korotin self-assigned this Mar 28, 2025
Copy link

codecov bot commented Mar 28, 2025

Codecov Report

Attention: Patch coverage is 88.40580% with 8 lines in your changes missing coverage. Please review.

Project coverage is 94.02%. Comparing base (1ad365a) to head (49e6118).

Files with missing lines Patch % Lines
internal/stopper/stopper.go 86.66% 7 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #104      +/-   ##
==========================================
- Coverage   94.62%   94.02%   -0.61%     
==========================================
  Files          11       12       +1     
  Lines         725      786      +61     
==========================================
+ Hits          686      739      +53     
- Misses         35       42       +7     
- Partials        4        5       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alitto
Copy link
Owner

alitto commented Mar 29, 2025

Hey @korotin!

This is a great find, thanks a lot!
Indeed, the stop logic is not synchronized with task submission so it could happen that a new worker is launched (and thus workersWaitGroup.Add(1) called) while waiting for the wait group.

I think we can fix this by reusing the existing mutex on the pool struct rather than using a different one, which could introduce some additional latencies (in my experience, calls to lock a mutex are pretty expensive so it's best to avoid it as much as possible).
I started to explore a solution in this PR #105 but i'll need to work on it for some time before merging it.

@korotin
Copy link
Collaborator Author

korotin commented Mar 29, 2025

My first attempt was to reuse pool's mutex but that didn't work well. We cannot guard waitgroup's calls with mutex because it may lead to deadlocks.

Anyway, I think it's best to close this one and continue discussion in your MR :)

@korotin korotin closed this Mar 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants