Skip to content

Conversation

vasild
Copy link
Contributor

@vasild vasild commented Sep 27, 2022

Silence false positives from TSAN about unsynchronized calls to BaseIndex::~BaseIndex() and BaseIndex::SetBestBlockIndex(). They are synchronized, but beyond the comprehension of TSAN - by SyncWithValidationInterfaceQueue(), called from BaseIndex::BlockUntilSyncedToCurrentChain().

Fixes #25365

@vasild
Copy link
Contributor Author

vasild commented Sep 27, 2022

A call to SyncWithValidationInterfaceQueue() also fixes it. I am +0.1 on stopping the scheduler thread and flushing the queue - looks more robust to me.

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK. This may fix the bug, but it could be better.

@ryanofsky
Copy link
Contributor

A call to SyncWithValidationInterfaceQueue() also fixes it. I am +0.1 on stopping the scheduler thread and flushing the queue - looks more robust to me.

Hmm, I would think the opposite SyncWithValidationInterfaceQueue seems like the more robust approach to me. It forces all events that may have been posted before the index is stopped to be processed before the index destructor is called.

Stopping the scheduler seems more indirect and more error prone because it make additional assumptions that validationinterface uses the scheduler internally and that it is safe to stop the scheduler at all because nothing else depends on it.

Also agree with Marco's suggestion that it would be good to stick to consistent approach across tests, which also favors SyncWithValidationInterfaceQueue

@vasild vasild force-pushed the fix_coinstatsindex_initial_sync branch from eaec74f to 40ef8e0 Compare September 29, 2022 08:19
@vasild
Copy link
Contributor Author

vasild commented Sep 29, 2022

eaec74f3b9...40ef8e064d: @ryanofsky, @MarcoFalke fair enough! Switched to SyncWithValidationInterfaceQueue() :-)

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -79,6 +79,10 @@ BOOST_FIXTURE_TEST_CASE(coinstatsindex_initial_sync, TestChain100Setup)
// Shutdown sequence (c.f. Shutdown() in init.cpp)
coin_stats_index.Stop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated: I think the two Stop(), here and in fa176e2 can be removed, as they are done by the destructor anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that all calls to Stop() are done just before the objects are destroyed (except the call from the destructor itself). Thus Stop() can be removed and its (short) body moved inside BaseIndex::~BaseIndex(). If this is the case then this deserves a separate PR. Am I missing something?

Copy link
Contributor

@ryanofsky ryanofsky Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calls to Stop need to happen before the BaseIndex destructor runs because they are responsible for joining the sync thread, which can call virtual methods. By the time the BaseIndex destructor runs, whatever subclass inherited from BaseIndex is partially destroyed and the virtual method calls won't work.

The call to Stop() in the destructor was never valid and I have a commit c499c87 in #24230 which removes it and asserts that it was called previously.

@vasild vasild force-pushed the fix_coinstatsindex_initial_sync branch from 40ef8e0 to 3ae1fe4 Compare September 29, 2022 09:52
@vasild
Copy link
Contributor Author

vasild commented Sep 29, 2022

40ef8e064d...3ae1fe4498: take suggestion to remove a comment

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3ae1fe4, but I think f08c9fb from #21726 added a real race that this workaround doesn't really address, and I made some suggestions below.

Comment on lines 82 to 80
// Make sure coin_stats_index is not used by the scheduler thread after it
// has been destroyed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "test: fix race in shutdown code in coinstatsindex_initial_sync" (3ae1fe4)

I think it would be better to move the SyncWithValidationInterfaceQueue() call up one line before the Stop() call, and not an introduce an artificial pause between stopping the index and destroying the index. There should be no need for an extra pause between stopping and destroying. Adding the pause in the wrong place in the test could potentially mask real bugs, so better to add it in the right place one line up.

I also think the code comment "Make sure coin_stats_index is not used by the scheduler thread" and PR description "fix race in shutdown code" are a little misleading, because this is not really fixing a race condition. This is working around TSAN false positive errors that happen because TSAN doesn't understand our BaseIndex::BlockUntilSyncedToCurrentChain and BaseIndex::BlockConnected functions.

TSAN alerts about a race condition because it doesn't see the test thread which the destroys the object explicitly waiting for the scheduler thread which calls BlockConnected. But this is because, as an optimization, our BlockUntilSyncedToCurrentChain function skips calling SyncWithValidationInterfaceQueue when m_best_block_index matches the chain tip. This is a valid optimization as long as the BaseIndex::BlockConnected implementation does not do any reads or writes to *this after it updates m_best_block_index. This was the case before f08c9fb from #21726 did add a real race bug by making AllowPrune() and GetName() calls after updating m_best_block_index. I think a fix for this could look like:

--- a/src/index/base.cpp
+++ b/src/index/base.cpp
@@ -414,10 +414,18 @@ IndexSummary BaseIndex::GetSummary() const
 void BaseIndex::SetBestBlockIndex(const CBlockIndex* block) {
     assert(!node::fPruneMode || AllowPrune());
 
-    m_best_block_index = block;
     if (AllowPrune() && block) {
         node::PruneLockInfo prune_lock;
         prune_lock.height_first = block->nHeight;
         WITH_LOCK(::cs_main, m_chainstate->m_blockman.UpdatePruneLock(GetName(), prune_lock));
     }
+    // For thread safety and compatibility with m_best_block_index optimization
+    // in the BlockUntilSyncedToCurrentChain(), update m_best_block_index as
+    // the last step in this function, and avoid updating state or accessing
+    // *this after m_best_block_index is set. If state is updated after
+    // m_best_block_index, external threads won't be able to rely on
+    // BlockUntilSyncedToCurrentChain() to be in sync with that state. And if
+    // *this is accessed, external threads won't be able to use
+    // BlockUntilSyncedToCurrentChain() to safely destroy the index object.
+    m_best_block_index = block;
 }

For this commit, I think a more accurate comment explaining the SyncWithValidationInterfaceQueue(); call would say something like "To avoid false positive TSAN errors, explicitly signal the test thread from the validationinterface thread so TSAN does not think there is a race between index shutdown code and BlockConnected notification code. The BlockUntilSyncedToCurrentChain() call above is sufficient to avoid these races in reality, but TSAN can't detect this due to an optimization in that function."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working around TSAN false positive errors

No, this is a real issue (use-after-free), see steps to reproduce without tsan (for example with valgrind): #25365 (comment)

Copy link
Contributor

@ryanofsky ryanofsky Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working around TSAN false positive errors

No, this is a real issue (use-after-free), see steps to reproduce without tsan (for example with valgrind): #25365 (comment)

Yes I did see that comment, but it was not reproducing an existing race condition, it was adding a new race condition by calling this->GetName() after the m_best_block_index = block assignment. If the m_best_block_index = block; assignment was done last, there would be no race.

My comment above may be a little muddled because I noticed the real race condition caused by f08c9fb from #21726 in the middle of writing it. I think we should fix the real race condition in non-test code making a change to SetBestBlockIndex like the one I suggested above. After we do that, the the current change 3ae1fe4 in this PR will just be silencing TSAN errors, not avoiding a real race condition anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was not reproducing an existing race condition, it was adding a new race condition by calling this->GetName() after the m_best_block_index = block assignment

Ok, but see also #25365 (comment). I could reproduce use-after-free in Valgrind even without adding a call to GetName(), just a sleep suffices.

I got the impression that when BaseIndex::SetBestBlockIndex() executes this is already freed, but I do not remember if I checked to access a member before the m_best_block_index assignment.

Moving the assignment to the bottom may fix it. Anyway, this whole story looks overly complicated to me. It shouldn't require multiple people discussing whether this is a bug or not and how to fix it exactly! I opened #26210 to explore an alternative.

Copy link
Contributor

@ryanofsky ryanofsky Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but see also #25365 (comment). I could reproduce use-after-free in Valgrind even without adding a call to GetName(), just a sleep suffices.

Right. Your reproduction puts the sleep right before the UpdatePruneLock(GetName(), ...) call, so it does not need to add a new call to GetName() and does reveal an existing race condition in the test. Marco's reproduction puts that sleep after that and adds another GetName() call, so it is really adding a new race condition.

I got the impression that when BaseIndex::SetBestBlockIndex() executes this is already freed, but I do not remember if I checked to access a member before the m_best_block_index assignment.

I don't think this is possible because BlockUntilSyncedToCurrentChain would not have returned before the m_best_block_index assignment was done.

Moving the assignment to the bottom may fix it. Anyway, this whole story looks overly complicated to me. It shouldn't require multiple people discussing whether this is a bug or not and how to fix it exactly! I opened #26210 to explore an alternative.

Yes I agree moving the assignment to the bottom is not the best long term solution for races generally. I just think moving the assignment is the most direct fix for the race condition introduced in #21726. Also I think it makes code more logical and seems like a reliability improvement. I opened a new PR #26215 which moves the assignment, and I think it would be the way to address the real race condition, and turn the TSAN true positives into false positives. After that, this PR would be a good way to fix remaining false positives (again with the suggestion that I think it would be better for test shutdown sequence to be 1-flush 2-stop 3-destroy not 1-stop 2-flush 3-destroy so test behavior is closer to bitcoind shutdown behavior).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I this is a false positive, I wonder what the steps are to reproduce. After all tsan should be deterministic, assuming manual sleeps are added at the right places?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I this is a false positive, I wonder what the steps are to reproduce. After all tsan should be deterministic, assuming manual sleeps are added at the right places?

If the goal is to reproduce TSAN false positive errors that remain after #26215 fixes the race condition introduced by #21726, maybe a good way would be to put a 2 second sleep at the end of SetBestBlockIndex after it assigns m_best_block_index and 1 second sleep in the test before it calls BlockUntilSyncedToCurrentChain. This way the TSAN should see the scheduler thread reading the vtable ptr, and the test thread writing it, and no apparent synchronization between the two threads because BlockUntilSyncedToCurrentChain will return early without ever calling SyncWithValidationInterfaceQueue

ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Sep 30, 2022
Since commit f08c9fb from PR
bitcoin#21726, the
`BlockUntilSyncedToCurrentChain` behavior has been less reliable and also
introduced a race condition in the coinstatsindex_initial_sync unit test.

It seems better for `BlockUntilSyncedToCurrentChain` to actually wait for the
last connected block to be fully processed, than to be able to return before
prune locks are set, so this switches the order of `m_best_block_index =
block;` and `UpdatePruneLock` statements in `SetBestBlockIndex` to make it more
reliable.

Also since commit f08c9fb, there has been a
race condition in the `coinstatsindex_initial_sync` test. Before that commit,
the atomic index best block pointer `m_best_block_index` was updated as the
last step of `BaseIndex::BlockConnected`, so `BlockUntilSyncedToCurrentChain`
could safely be used in tests to wait for the last `BlockConnected`
notification to be finished before stopping and destroying the index.  But
after that commit, calling `BlockUntilSyncedToCurrentChain` is no longer
sufficient, and there is a race between the test shutdown code which destroys
the index object and the new code introduced in that commit calling
`AllowPrune()` and `GetName()` on the index object. Reproducibility
instructions for this are in
bitcoin#25365 (comment)

This commit fixes the `coinstatsindex_initial_sync` race condition, even though
it will require an additional change to silence TSAN false positives,
bitcoin#26188, after it is fixed. So this
partially addresses but does not resolve the bug reporting TSAN errors
bitcoin#25365.

There is no known race condition outside of test code currently, because the
bitcoind `Shutdown` function calls `FlushBackgroundCallbacks` not
`BlockUntilSyncedToCurrentChain` to safely shut down.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Co-authored-by: MacroFake <falke.marco@gmail.com>
@vasild vasild force-pushed the fix_coinstatsindex_initial_sync branch from 3ae1fe4 to d2395c8 Compare October 4, 2022 10:40
@vasild
Copy link
Contributor Author

vasild commented Oct 4, 2022

3ae1fe4498...d2395c83c4: rebase + move SyncWithValidationInterfaceQueue() before coin_stats_index.Stop() as suggested above.

I think this should be good to go, unrelated to other PRs, as a minimal change that fixes the problem, to be included in 24.x.

@vasild
Copy link
Contributor Author

vasild commented Oct 4, 2022

d2395c83c4..c4566600be: use same shutdown order also in txindex_tests.

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK c456660. But it would be better if #26215 were merged before this PR, because unlike this PR, #26215 is a direct fix for the race condition introduced in #21726. After #26215, this PR would be easier to understand as a followup that addresses TSAN false positive errors and makes test shutdown sequences consistent.

I don't think it would be good to merge this PR before #26215, because it would only be fixing the race condition indirectly instead of directly, and it would be adding misleading comments about the purpose of the SyncWithValidationInterfaceQueue call following the BlockUntilSyncedToCurrentChain call. The new SyncWithValidationInterfaceQueue call shouldn't be necessary for anything other than making TSAN happy.

Comment on lines 82 to 80
// Make sure coin_stats_index is not used by the scheduler thread after it
// has been destroyed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I this is a false positive, I wonder what the steps are to reproduce. After all tsan should be deterministic, assuming manual sleeps are added at the right places?

If the goal is to reproduce TSAN false positive errors that remain after #26215 fixes the race condition introduced by #21726, maybe a good way would be to put a 2 second sleep at the end of SetBestBlockIndex after it assigns m_best_block_index and 1 second sleep in the test before it calls BlockUntilSyncedToCurrentChain. This way the TSAN should see the scheduler thread reading the vtable ptr, and the test thread writing it, and no apparent synchronization between the two threads because BlockUntilSyncedToCurrentChain will return early without ever calling SyncWithValidationInterfaceQueue

ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Oct 5, 2022
Since commit f08c9fb from PR
bitcoin#21726, index
`BlockUntilSyncedToCurrentChain` behavior has been less reliable, and there has
also been a race condition in the `coinstatsindex_initial_sync` unit test.

It seems better for `BlockUntilSyncedToCurrentChain` to actually wait for the
last connected block to be fully processed, than to be able to return before
prune locks are set, so this switches the order of `m_best_block_index =
block;` and `UpdatePruneLock` statements in `SetBestBlockIndex` to make it more
reliable.

Also since commit f08c9fb, there has been a
race condition in the `coinstatsindex_initial_sync` test. Before that commit,
the atomic index best block pointer `m_best_block_index` was updated as the
last step of `BaseIndex::BlockConnected`, so `BlockUntilSyncedToCurrentChain`
could safely be used in tests to wait for the last `BlockConnected`
notification to be finished before stopping and destroying the index. But
after that commit, calling `BlockUntilSyncedToCurrentChain` is no longer
sufficient, and there is a race between the test shutdown code which destroys
the index object and the new code introduced in that commit calling
`AllowPrune()` and `GetName()` on the index object. Reproducibility
instructions for this are in
bitcoin#25365 (comment)

This commit fixes the `coinstatsindex_initial_sync` race condition, even though
it will require an additional change to silence TSAN false positives,
bitcoin#26188, after it is fixed. So this
partially addresses but does not resolve the bug reporting TSAN errors
bitcoin#25365.

There is no known race condition outside of test code currently, because the
bitcoind `Shutdown` function calls `FlushBackgroundCallbacks` not
`BlockUntilSyncedToCurrentChain` to safely shut down.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Co-authored-by: MacroFake <falke.marco@gmail.com>
fanquake added a commit to bitcoin-core/gui that referenced this pull request Oct 10, 2022
…edToCurrentChain reliability

8891949 index: Improve BaseIndex::BlockUntilSyncedToCurrentChain reliability (Ryan Ofsky)

Pull request description:

  Since commit f08c9fb from PR bitcoin/bitcoin#21726, index  `BlockUntilSyncedToCurrentChain` behavior has been less reliable, and there has also been a race condition in the `coinstatsindex_initial_sync` unit test.

  It seems better for `BlockUntilSyncedToCurrentChain` to actually wait for the last connected block to be fully processed, than to be able to return before prune locks are set, so this switches the order of `m_best_block_index = block;` and `UpdatePruneLock` statements in `SetBestBlockIndex` to make it more reliable.

  Also since commit f08c9fb, there has been a race condition in the `coinstatsindex_initial_sync` test. Before that commit, the atomic index best block pointer `m_best_block_index` was updated as the last step of `BaseIndex::BlockConnected`, so `BlockUntilSyncedToCurrentChain` could safely be used in tests to wait for the last `BlockConnected` notification to be finished before stopping and destroying the index.  But after that commit, calling `BlockUntilSyncedToCurrentChain` is no longer sufficient, and there is a race between the test shutdown code which destroys the index object and the new code introduced in that commit calling `AllowPrune()` and `GetName()` on the index object. Reproducibility instructions for this are in bitcoin/bitcoin#25365 (comment)

  This commit fixes the `coinstatsindex_initial_sync` race condition, even though it will require an additional change to silence TSAN false positives, bitcoin/bitcoin#26188, after it is fixed. So this partially addresses but does not resolve the bug reporting TSAN errors bitcoin/bitcoin#25365.

  There is no known race condition outside of test code currently, because the bitcoind `Shutdown` function calls `FlushBackgroundCallbacks` not `BlockUntilSyncedToCurrentChain` to safely shut down.

  Co-authored-by: vasild
  Co-authored-by: MarcoFalke

ACKs for top commit:
  mzumsande:
    re-ACK 8891949

Tree-SHA512: 52e29e3772a0c92873c54e5ffb31dd66a909b68a2031b7585713cd1d976811289c98bd9bb41679a8689062f03be4f97bb8368696e789caa4607c2fd8b1fe289b
@fanquake
Copy link
Member

#26215 has been merged. @vasild can you rebase/follow up with @ryanofsky's recent comments here.

@vasild vasild force-pushed the fix_coinstatsindex_initial_sync branch from c456660 to 26bef29 Compare October 10, 2022 13:29
@vasild
Copy link
Contributor Author

vasild commented Oct 10, 2022

c4566600be...26bef2957f: rebase and adjust after #26215

Note - I cannot reproduce this with the steps from #26188 (comment).

@maflcko
Copy link
Member

maflcko commented Oct 10, 2022

Update title/OP?

@vasild vasild changed the title test: fix race in shutdown code in coinstatsindex_initial_sync test: silence TSAN false positive in coinstatsindex_initial_sync Oct 10, 2022
@vasild
Copy link
Contributor Author

vasild commented Oct 10, 2022

Update title/OP?

Done, sorry.

@maflcko
Copy link
Member

maflcko commented Oct 10, 2022

I can't reproduce this either

@maflcko maflcko removed this from the 24.0 milestone Oct 10, 2022
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Oct 10, 2022
…entChain reliability

8891949 index: Improve BaseIndex::BlockUntilSyncedToCurrentChain reliability (Ryan Ofsky)

Pull request description:

  Since commit f08c9fb from PR bitcoin#21726, index  `BlockUntilSyncedToCurrentChain` behavior has been less reliable, and there has also been a race condition in the `coinstatsindex_initial_sync` unit test.

  It seems better for `BlockUntilSyncedToCurrentChain` to actually wait for the last connected block to be fully processed, than to be able to return before prune locks are set, so this switches the order of `m_best_block_index = block;` and `UpdatePruneLock` statements in `SetBestBlockIndex` to make it more reliable.

  Also since commit f08c9fb, there has been a race condition in the `coinstatsindex_initial_sync` test. Before that commit, the atomic index best block pointer `m_best_block_index` was updated as the last step of `BaseIndex::BlockConnected`, so `BlockUntilSyncedToCurrentChain` could safely be used in tests to wait for the last `BlockConnected` notification to be finished before stopping and destroying the index.  But after that commit, calling `BlockUntilSyncedToCurrentChain` is no longer sufficient, and there is a race between the test shutdown code which destroys the index object and the new code introduced in that commit calling `AllowPrune()` and `GetName()` on the index object. Reproducibility instructions for this are in bitcoin#25365 (comment)

  This commit fixes the `coinstatsindex_initial_sync` race condition, even though it will require an additional change to silence TSAN false positives, bitcoin#26188, after it is fixed. So this partially addresses but does not resolve the bug reporting TSAN errors bitcoin#25365.

  There is no known race condition outside of test code currently, because the bitcoind `Shutdown` function calls `FlushBackgroundCallbacks` not `BlockUntilSyncedToCurrentChain` to safely shut down.

  Co-authored-by: vasild
  Co-authored-by: MarcoFalke

ACKs for top commit:
  mzumsande:
    re-ACK 8891949

Tree-SHA512: 52e29e3772a0c92873c54e5ffb31dd66a909b68a2031b7585713cd1d976811289c98bd9bb41679a8689062f03be4f97bb8368696e789caa4607c2fd8b1fe289b
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 26bef29. I do think this change improves the tests, even if we can't reproduce the original TSAN errors anymore. The PR fixes a misleading comment in the txindex_initial_sync test, and makes shutdown order in that test more closely match bitcoind shutdown order. It also makes the coinstatsindex_initial_sync test consistent with the txindex_initial_sync test.

If we are sure there are no TSAN false positive errors here, a potentially simpler alternative to this PR is to remove the SyncWithValidationInterfaceQueue calls and comments from both tests. It's possible TSAN is taking atomic reads and writes into account and can infer that the test thread is waiting for the notification thread because it is checking the atomic m_best_block_index variable. That could explain why sleeps suggested in #26188 (comment) don't cause errors, even if the sleeps do cause BlockUntilSyncedToCurrentChain to return early without calling SyncWithValidationInterfaceQueue.

fanquake pushed a commit to fanquake/bitcoin that referenced this pull request Oct 11, 2022
Since commit f08c9fb from PR
bitcoin#21726, index
`BlockUntilSyncedToCurrentChain` behavior has been less reliable, and there has
also been a race condition in the `coinstatsindex_initial_sync` unit test.

It seems better for `BlockUntilSyncedToCurrentChain` to actually wait for the
last connected block to be fully processed, than to be able to return before
prune locks are set, so this switches the order of `m_best_block_index =
block;` and `UpdatePruneLock` statements in `SetBestBlockIndex` to make it more
reliable.

Also since commit f08c9fb, there has been a
race condition in the `coinstatsindex_initial_sync` test. Before that commit,
the atomic index best block pointer `m_best_block_index` was updated as the
last step of `BaseIndex::BlockConnected`, so `BlockUntilSyncedToCurrentChain`
could safely be used in tests to wait for the last `BlockConnected`
notification to be finished before stopping and destroying the index. But
after that commit, calling `BlockUntilSyncedToCurrentChain` is no longer
sufficient, and there is a race between the test shutdown code which destroys
the index object and the new code introduced in that commit calling
`AllowPrune()` and `GetName()` on the index object. Reproducibility
instructions for this are in
bitcoin#25365 (comment)

This commit fixes the `coinstatsindex_initial_sync` race condition, even though
it will require an additional change to silence TSAN false positives,
bitcoin#26188, after it is fixed. So this
partially addresses but does not resolve the bug reporting TSAN errors
bitcoin#25365.

There is no known race condition outside of test code currently, because the
bitcoind `Shutdown` function calls `FlushBackgroundCallbacks` not
`BlockUntilSyncedToCurrentChain` to safely shut down.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Co-authored-by: MacroFake <falke.marco@gmail.com>

Github-Pull: bitcoin#26215
Rebased-From: 8891949
@vasild vasild force-pushed the fix_coinstatsindex_initial_sync branch from 26bef29 to 861cb3f Compare October 11, 2022 07:48
@vasild
Copy link
Contributor Author

vasild commented Oct 11, 2022

26bef2957f...861cb3fadc: reword the comments

... remove the SyncWithValidationInterfaceQueue calls and comments from both tests ...

I am also ok with this, since it is a test-only and a false positive, to leave it "unfixed" until it happens again in CI, if ever. @MarcoFalke, what do you think?

@maflcko
Copy link
Member

maflcko commented Oct 11, 2022

review ACK 861cb3f

I think anything is fine here. Either keep both or delete both.

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 861cb3f. Just comment change since last review.

Looks like this is ready to merge and issue #25365 is resolved.

@fanquake fanquake merged commit 422efca into bitcoin:master Oct 13, 2022
@vasild vasild deleted the fix_coinstatsindex_initial_sync branch October 13, 2022 08:53
adam2k pushed a commit to adam2k/bitcoin that referenced this pull request Oct 19, 2022
Since commit f08c9fb from PR
bitcoin#21726, index
`BlockUntilSyncedToCurrentChain` behavior has been less reliable, and there has
also been a race condition in the `coinstatsindex_initial_sync` unit test.

It seems better for `BlockUntilSyncedToCurrentChain` to actually wait for the
last connected block to be fully processed, than to be able to return before
prune locks are set, so this switches the order of `m_best_block_index =
block;` and `UpdatePruneLock` statements in `SetBestBlockIndex` to make it more
reliable.

Also since commit f08c9fb, there has been a
race condition in the `coinstatsindex_initial_sync` test. Before that commit,
the atomic index best block pointer `m_best_block_index` was updated as the
last step of `BaseIndex::BlockConnected`, so `BlockUntilSyncedToCurrentChain`
could safely be used in tests to wait for the last `BlockConnected`
notification to be finished before stopping and destroying the index. But
after that commit, calling `BlockUntilSyncedToCurrentChain` is no longer
sufficient, and there is a race between the test shutdown code which destroys
the index object and the new code introduced in that commit calling
`AllowPrune()` and `GetName()` on the index object. Reproducibility
instructions for this are in
bitcoin#25365 (comment)

This commit fixes the `coinstatsindex_initial_sync` race condition, even though
it will require an additional change to silence TSAN false positives,
bitcoin#26188, after it is fixed. So this
partially addresses but does not resolve the bug reporting TSAN errors
bitcoin#25365.

There is no known race condition outside of test code currently, because the
bitcoind `Shutdown` function calls `FlushBackgroundCallbacks` not
`BlockUntilSyncedToCurrentChain` to safely shut down.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Co-authored-by: MacroFake <falke.marco@gmail.com>
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Oct 23, 2022
…dex_initial_sync

861cb3f test: move SyncWithValidationInterfaceQueue() before Stop() in txindex_tests (Vasil Dimov)
6526dc3 test: silence TSAN false positive in coinstatsindex_initial_sync (Vasil Dimov)

Pull request description:

  Silence false positives from TSAN about unsynchronized calls to `BaseIndex::~BaseIndex()` and `BaseIndex::SetBestBlockIndex()`. They are synchronized, but beyond the comprehension of TSAN - by `SyncWithValidationInterfaceQueue()`, called from `BaseIndex::BlockUntilSyncedToCurrentChain()`.

  Fixes bitcoin#25365

ACKs for top commit:
  MarcoFalke:
    review ACK 861cb3f
  ryanofsky:
    Code review ACK 861cb3f. Just comment change since last review.

Tree-SHA512: 8c30fdf2fd11d54e9adfa68a67185ab820bd7bd9f7f3ad6456e7e6d219fa9cf6d34b41e98e723eae86cb0c1baef7f3fc57b1b011a13dc3fe3d78334b9b5596de
janus pushed a commit to BitgesellOfficial/bitgesell that referenced this pull request Jan 20, 2023
Since commit f08c9fb0c6a799e3cb75ca5f763a746471625beb from PR
bitcoin/bitcoin#21726, index
`BlockUntilSyncedToCurrentChain` behavior has been less reliable, and there has
also been a race condition in the `coinstatsindex_initial_sync` unit test.

It seems better for `BlockUntilSyncedToCurrentChain` to actually wait for the
last connected block to be fully processed, than to be able to return before
prune locks are set, so this switches the order of `m_best_block_index =
block;` and `UpdatePruneLock` statements in `SetBestBlockIndex` to make it more
reliable.

Also since commit f08c9fb0c6a799e3cb75ca5f763a746471625beb, there has been a
race condition in the `coinstatsindex_initial_sync` test. Before that commit,
the atomic index best block pointer `m_best_block_index` was updated as the
last step of `BaseIndex::BlockConnected`, so `BlockUntilSyncedToCurrentChain`
could safely be used in tests to wait for the last `BlockConnected`
notification to be finished before stopping and destroying the index. But
after that commit, calling `BlockUntilSyncedToCurrentChain` is no longer
sufficient, and there is a race between the test shutdown code which destroys
the index object and the new code introduced in that commit calling
`AllowPrune()` and `GetName()` on the index object. Reproducibility
instructions for this are in
bitcoin/bitcoin#25365 (comment)

This commit fixes the `coinstatsindex_initial_sync` race condition, even though
it will require an additional change to silence TSAN false positives,
bitcoin/bitcoin#26188, after it is fixed. So this
partially addresses but does not resolve the bug reporting TSAN errors
bitcoin/bitcoin#25365.

There is no known race condition outside of test code currently, because the
bitcoind `Shutdown` function calls `FlushBackgroundCallbacks` not
`BlockUntilSyncedToCurrentChain` to safely shut down.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Co-authored-by: MacroFake <falke.marco@gmail.com>
aguycalled pushed a commit to nav-io/navio-core that referenced this pull request Jan 25, 2023
* doc: add info about status code 404 for some rest endpoints

* refactor: remove duplicate code from BlockAssembler

* build: Fix `capnp` package build for Android

* build: Specify native binaries explicitly when building `capnp` package

From `configure --help`:
  --with-external-capnp   use the system capnp binary (or the one specified
                          with $CAPNP) instead of compiling a new one (useful
                          for cross-compiling)

* net: remove useless call to IsReachable() from CConnman::Bind()

`CConnman::Bind()` is called without `BF_EXPLICIT` only when passed
either `0.0.0.0` or `::`. For those addresses `IsReachable()` is always
true (regardless of the `-onlynet=` setting!), meaning that the `if`
condition never evaluates to true.

`IsReachable()` is always true for the "any" IPv4 and IPv6 addresses
because `CNetAddr::GetNetwork()` returns `NET_UNROUTABLE` instead of
`NET_IPV4` or `NET_IPV6` and the network `NET_UNROUTABLE` is always
considered reachable.

It follows that `BF_EXPLICIT` is unnecessary, remove it too.

* sync: simplify MaybeCheckNotHeld() definitions by using a template

Reduce 4 of the `MaybeCheckNotHeld()` definitions to 2 by using a
template. This also makes the function usable for other
[BasicLockable](https://en.cppreference.com/w/cpp/named_req/BasicLockable)
types.

* sync: remove unused template parameter from ::UniqueLock

The template parameter `typename Base = typename Mutex::UniqueLock` is
not used, so remove it. Use internally defined type `Base` to avoid
repetitions of `Mutex::UniqueLock`.

* scripted-diff: Rename time symbols

-BEGIN VERIFY SCRIPT-

 ren() { sed -i "s:\<$1\>:$2:g" $(git grep -l "\<$1\>" ':(exclude)src/versionbits.cpp') ; }

 ren nStart                 time_start
 ren nTimeStart             time_start
 ren nTimeReadFromDiskTotal time_read_from_disk_total
 ren nTimeConnectTotal      time_connect_total
 ren nTimeFlush             time_flush
 ren nTimeChainState        time_chainstate
 ren nTimePostConnect       time_post_connect
 ren nTimeCheck             time_check
 ren nTimeForks             time_forks
 ren nTimeConnect           time_connect
 ren nTimeVerify            time_verify
 ren nTimeUndo              time_undo
 ren nTimeIndex             time_index
 ren nTimeTotal             time_total
 ren nTime1                 time_1
 ren nTime2                 time_2
 ren nTime3                 time_3
 ren nTime4                 time_4
 ren nTime5                 time_5
 ren nTime6                 time_6

 ren nBlocksTotal num_blocks_total

 # Newline after semicolon
 perl -0777 -pi -e 's/; time_connect_total/;\n        time_connect_total/g' src/validation.cpp
 perl -0777 -pi -e 's/; time_/;\n    time_/g'                               src/validation.cpp

-END VERIFY SCRIPT-

* Use steady clock for bench logging

* doc: add historical 0.20.2 release notes

* doc: add historical 0.21.2 release notes

* build: split ARM crc & crypto extension checks

We currently perform the same check twice, to put the same set of flags
in two different variables. Split the checks so we test for crc and crypto
extensions independently.

If we don't want to split, we should just delete the second AX_CHECK_COMPILE_FLAG
check, and set ARM_CRC_CXXFLAGS & ARM_CRC_CXXFLAGS at the same time.

* refactor: Do not discard `try_lock()` return value

Microsoft's C++ Standard Library uses the `[[nodiscard]]` attribute for
`try_lock()`.
See: https://github.com/microsoft/STL/blob/main/stl/inc/mutex

This change allows to drop the current suppression for the warning C4838
and helps to prevent the upcoming warning C4858.
See: microsoft/STL@539c26c

* signet/miner: reduce default interblock interval limit to 30min

Also allow the operator to change it, if desired, without having
to edit the code.

* Squashed 'src/leveldb/' changes from 22f1e4a02f..e2f10b4e47

e2f10b4e47 Merge bitcoin-core/leveldb-subtree#34: win32: fix -Wmissing-field-initializers warnings
12c52b392d win32: fix -Wmissing-field-initializers warnings

git-subtree-dir: src/leveldb
git-subtree-split: e2f10b4e47bc950a81bc96d1c6db3a8048216642

* ci: Bump vcpkg to the latest version `2022.09.27`

Dependency changes (2022.06.16.1 - 2022.09.27):
 - boost 1.79.0#0 -> 1.80.0#0
 - sqlite3 3.37.2#1 -> 3.39.2#0
 - zeromq 4.3.4#5 -> 4.3.4#6

* contrib: Fix capture_output in getcoins.py

Our required Python version 3.6.12 does not support `capture_output` as
a subprocess.run argument; this was added in python 3.7.

We can emulate it by setting stdout and stderr to subprocess.PIPE

* fuzz: Limit outpoints.size in txorphan target to avoid OOM

* refactor: move Boost datetime usage to wallet

This means we don't need datetime in a --disable-wallet build, and it
isn't included in the kernel.

* wallet: Use correct effective value when checking target

* test: Check external coin effective value is used in CoinSelection

* test: Use proper Boost macros instead of assertions

* ci: Run `bench_bitcoin.exe --sanity-check` in "Win64 native" task

Also a better name used for the script as it follows GNU's `make check`.

* doc: bump bips.md up-to-date version to v24.0

This is a trivial follow-up to bitcoin#26124.

* refactor: Drop `owns_lock()` call

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

* refactor: move DEFAULT_TXINDEX from validation to txindex

* refactor: move DEFAULT_COINSTATSINDEX from validation to coinstatsindex

* refactor: move DEFAULT_BLOCKFILTERINDEX from val to blockfilterindex

* kernel: remove util/bytevectorhash.cpp

* ci: Move `git config` commands into script where they are used

* ci: Use same `merge_script` implementation for Windows as for all

* ci: Remove unused package

Address feedback from https://github.com/bitcoin/bitcoin/pull/24561/files#r985719812

* ci: Allow PIP_PACKAGES on centos

This was added in 7fc5e86 but I can't
see a reason why this should be forbidden.

* test: Remove unused fCheckpointsEnabled from miner_tests

The earliest checkpoint is at height 11111, so this can't possibly have
any impact on this test.

* build, msvc: Enable C4834 warning

https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/c4834

* kernel: move RunCommandParseJSON to its own file

Because libbitcoinkernel does not include this new object, this has the
side-effect of eliminating the unnecessary boost::process dependency.

* ci: Workaround Windows filesystem executable bit loss

* fuzz: add util/mempool/h.cpp

Moving the mempool code (Boost) out of util.h, results in a ~10% speedup
(for me) when compiling the fuzz tests.

* fuzz: pass max fee into ConsumeTxMemPoolEntry

* refactor: Make 64-bit shift explicit

Also this change enables MSVC warning C4334 for all codebase.

https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-3-c4334

* refactor: move run_command from util to common

Quoting ryanofsky: "util can be the library for things included in the kernel
which the kernel can depend on, and common can be the library for other code
that needs to be shared internally, but should not be part of the kernel or
shared externally."

* Remove clang-format from lint task

clang-format could be used in scripted diffs, but remained largely
unused.

* test: Pass mempool reference to AssemblerForTest

* test: Use dedicated mempool in TestPrioritisedMining

No need for a shared mempool. Also remove unused chainparams parameter.

* test: Use dedicated mempool in TestPackageSelection

No need for a shared mempool. Also remove unused chainparams parameter.

* test: Use dedicated mempool in TestBasicMining

No need for a shared mempool. Also remove unused chainparams parameter.

Can be reviewed with --ignore-all-space

* refactor: mempool: add MemPoolLimits::NoLimits()

There are quite a few places in the codebase that require us to
construct a CTxMemPool without limits on ancestors and descendants.
This helper function allows us to get rid of all that duplication.

* refactor: mempool: use CTxMempool::Limits

Simplifies function signatures by removing repetition of all the
ancestor/descendant limits,  and increases readability by being
more verbose by naming the limits, while still reducing the LoC.

* test: use NoLimits() in MempoolIndexingTest

The (100, 1000000, 1000, 1000000) limits are arbitrarily high and
don't restrict anything, they are just meant to calculate ancestors
properly. Using NoLimits() makes this intent more clear and simplifies
the code.

* docs: improve docs where MemPoolLimits is used

* Remove unused CDataStream::rdbuf method

It is unused and seems unlikely to be ever used.

* index: Improve BaseIndex::BlockUntilSyncedToCurrentChain reliability

Since commit f08c9fb from PR
bitcoin#21726, index
`BlockUntilSyncedToCurrentChain` behavior has been less reliable, and there has
also been a race condition in the `coinstatsindex_initial_sync` unit test.

It seems better for `BlockUntilSyncedToCurrentChain` to actually wait for the
last connected block to be fully processed, than to be able to return before
prune locks are set, so this switches the order of `m_best_block_index =
block;` and `UpdatePruneLock` statements in `SetBestBlockIndex` to make it more
reliable.

Also since commit f08c9fb, there has been a
race condition in the `coinstatsindex_initial_sync` test. Before that commit,
the atomic index best block pointer `m_best_block_index` was updated as the
last step of `BaseIndex::BlockConnected`, so `BlockUntilSyncedToCurrentChain`
could safely be used in tests to wait for the last `BlockConnected`
notification to be finished before stopping and destroying the index. But
after that commit, calling `BlockUntilSyncedToCurrentChain` is no longer
sufficient, and there is a race between the test shutdown code which destroys
the index object and the new code introduced in that commit calling
`AllowPrune()` and `GetName()` on the index object. Reproducibility
instructions for this are in
bitcoin#25365 (comment)

This commit fixes the `coinstatsindex_initial_sync` race condition, even though
it will require an additional change to silence TSAN false positives,
bitcoin#26188, after it is fixed. So this
partially addresses but does not resolve the bug reporting TSAN errors
bitcoin#25365.

There is no known race condition outside of test code currently, because the
bitcoind `Shutdown` function calls `FlushBackgroundCallbacks` not
`BlockUntilSyncedToCurrentChain` to safely shut down.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Co-authored-by: MacroFake <falke.marco@gmail.com>

* test: Prevent UB in `minisketch_tests.cpp`

* test: Remove confusing DUMMY_P2WPKH_SCRIPT

* docs: fix m_children to be a member of CTxMemPoolEntry

* wallet: have prune error take precedence over assumedvalid

From Russ Yanofsky:

"Agree with all of Marco's points here and think this should be updated

If havePrune and hasAssumedValidChain are both true, better to show
havePrune error message.  Assumed-valid error message is vague and not
very actionable.  Would suggest "Error loading wallet. Wallet requires
blocks to be downloaded, and software does not currently support loading
wallets while blocks are being downloaded out of order though assumeutxo
snapshots. Wallet should be able to load successfully after node sync
reaches height {block_height}"

Co-authored-by: MacroFake <MarcoFalke@gmail.com>
Co-authored-by: Russ Yanofsky <russ@yanofsky.org>

* Fix comment typos in scriptpubkeyman.cpp, wallet.cpp, wallet.h

Fix comment typos:
sigature -> signature
ponter -> pointer
it's key -> its key

* sync: avoid confusing name overlap (Mutex)

Use `MutexType` instead of `Mutex` for the template parameter of
`UniqueLock` because there is already a class named `Mutex` and the
naming overlap is confusing. `MutexType` is used elsewhere in `sync.h`.

* sync: remove DebugLock alias template

Use `UniqueLock` directly. Type deduction works just fine from the first
argument to the constructor of `UniqueLock`, so there is no need to
repeat

```cpp
UniqueLock<typename std::remove_reference<typename std::remove_pointer<decltype(cs)>::type>::type>
```

five times in the `LOCK` macros. Just `UniqueLock` suffices.

* sync: rename AnnotatedMixin::UniqueLock to AnnotatedMixin::unique_lock

This avoids confusion with the global `UniqueLock` and the snake case
is consistent with `UniqueLock::reverse_lock.

* iwyu: Add zmq source files

* Update .cirrus.yml

Co-authored-by: brunoerg <brunoely.gc@gmail.com>
Co-authored-by: James O'Beirne <james.obeirne@pm.me>
Co-authored-by: Hennadii Stepanov <32963518+hebasto@users.noreply.github.com>
Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Co-authored-by: MacroFake <falke.marco@gmail.com>
Co-authored-by: fanquake <fanquake@gmail.com>
Co-authored-by: Anthony Towns <aj@erisian.com.au>
Co-authored-by: willcl-ark <will8clark@gmail.com>
Co-authored-by: Aurèle Oulès <aurele@oules.com>
Co-authored-by: Sebastian Falbesoner <sebastian.falbesoner@gmail.com>
Co-authored-by: glozow <gloriajzhao@gmail.com>
Co-authored-by: Cory Fields <cory-nospam-@coryfields.com>
Co-authored-by: stickies-v <stickies-v@protonmail.com>
Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
Co-authored-by: MacroFake <MarcoFalke@gmail.com>
Co-authored-by: Russ Yanofsky <russ@yanofsky.org>
Co-authored-by: Dimitris Tsapakidis <dimitris@tsapakidis.com>
Co-authored-by: alex v <alex@encrypt-s.com>
@bitcoin bitcoin locked and limited conversation to collaborators Oct 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) in BaseIndex
5 participants