Skip to content

Conversation

paulyc
Copy link

@paulyc paulyc commented Jan 3, 2020

Fix a very longstanding bug wherein the txindex LevelDB was not being closed properly when stopping bitcoind,
which is done by leveldb::DB::~DB(), ie., you need to call delete once for every DB* you create.
This was causing the txindexing progress to have fallen behind to its last random checkpoint the next time
bitcoind was started.

Fix a very longstanding bug wherein the txindex LevelDB was not being closed properly when stopping bitcoind,
which is done by leveldb::DB::~DB(), ie., you need to call delete once for every DB* you create.
This was causing the txindexing progress to have fallen behind to its last random checkpoint the next time
bitcoind was started.
TxIndex::~TxIndex() {}
TxIndex::~TxIndex()
{
m_db.reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? The pointer should be reset when the TxIndex is destructed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it should, (shouldn't it???) (clearly both C++ and myself are too old for me to be writing it), so it would seem that, at least in some corner case, ~TxIndex isn't getting called at all. So I will close this for now and reopen it if I figure out what's going on further investigation. Thanks

Copy link
Author

@paulyc paulyc Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, but wouldn't really know, if the default destructor call isn't somehow ordered, or being reordered, such that it doesn't finish before the process exits, and the explicit call here maybe allows it to finish, but I'd still have to test it more and maybe figure out a way to write an integration test that verifies all the leveldbs are properly closed (or not). Will investigate further

Copy link
Author

@paulyc paulyc Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly since leveldb is rather poorly designed imo to put some very important closing the database code in the destructor, since, after all, the compiler (and programmer, and OS), don't necessarily know or care that any of those objects still in memory that are about to be wiped out by the OS when the process exits, actually do important things that need to happen before the process does exit! And thus, quite often, do not bother actually destructing any of those objects, since nobody cares, (but this is exactly why they should care), about memory that leaks when the process is about to exit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly seems like the patch works, but, I can neither explain why nor prove that it does definitively, so, not sure if that quite reaches the merge bar, in light of it being such a minor change to such a minor subsystem that isn't functionally broken, rather just a little bit inconvenient having to wait a while, while it reindexes the same blocks ;) but it's been bugging me (no pun intended) for years..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the issue you are seeing? Any stacktraces or anything?

Reminds me of #13894, #14071, #15410

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, was just able to verify that this patch does not in fact mysteriously fix the issue, so it remains mysterious at the moment.

What happens is, if you interrupt bitcoind in the middle of a sync with txindex on, the next time you run it, the txindexer has suddenly seemed to have fallen many blocks behind where it was when you interrupted the first process, for whatever reason that is yet to be determined, but the chainstate is always properly synced and closed so there must be a hint in comparing how the two are used differently.

So, when the txindex thread restarts it wastes a bunch of time indexing blocks it has already indexed, which you can tell because no new data is written to the leveldb until it catches back up since the keys and values in the txindex don't change. Will file a bug report if i can't find an existing one.

I haven't seen any crashes or anything worse than wasting some time and disk i/o, so it's not a real big deal in the grand scheme of the txindexer's overall time, just slows down syncing when you do restart it until the indexer catches back up, so if you're frequently stopping and starting the sync like because you have a magnetic spinning disk and it's your only disk, it can get a little irritating. But then that's always irritating 😂 thanks for the info

@paulyc paulyc closed this Jan 3, 2020
@paulyc
Copy link
Author

paulyc commented Jan 4, 2020

Whoa i dunno what just happened. Was just adding a comment but i inadvertently hit "close conversation" not as if i know what that button does but everything disappeared. Good thing i always write everything in a text editor because I'm so used to Android killing my apps and taking my drafts with them i dont like any ux that just deletes what you were working on because you hit one button. People accidentally hit buttons all the time. Anyway back on subject:

Ah yes, was just able to verify that this patch does not in fact mysteriously fix the issue, so it remains mysterious at the moment.

What happens is, if you interrupt bitcoind in the middle of a sync with txindex on, the next time you run it, the txindexer has suddenly seemed to have fallen many blocks behind where it was when you interrupted the first process, for whatever reason that is yet to be determined, but the chainstate is always properly synced and closed so there must be a hint in comparing how the two are used differently.

So, when the txindex thread restarts it wastes a bunch of time indexing blocks it has already indexed, which you can tell because no new data is written to the leveldb until it catches back up since the keys and values in the txindex don't change. Will file a bug report if i can't find an existing one.

I haven't seen any crashes or anything worse than wasting some time and disk i/o, so it's not a real big deal in the grand scheme of the txindexer's overall time, just slows down syncing when you do restart it until the indexer catches back up, so if you're frequently stopping and starting the sync like because you have a magnetic spinning disk and it's your only disk, it can get a little irritating. But then that's always irritating 😂 thanks for the info

@maflcko
Copy link
Member

maflcko commented Jan 6, 2020

What happens is, if you interrupt bitcoind in the middle of a sync with txindex on, the next time you run it, the txindexer has suddenly seemed to have fallen many blocks behind where it was when you interrupted the first process

Txindex is a background thread, so it might fall back arbitrarily. Though it shouldn't lose progress from where it was in normal operation.

@jimpo
Copy link
Contributor

jimpo commented Jan 7, 2020

@paulyc Oh, thank you, very nice catch. This is happening because the txindex (and the other indices) are stopped in Shutdown (init.cpp) before ForceFlushStateToDisk is called. ForceFlushStateToDisk triggers the ChainStateFlushed callback on the indices which commits the indexed state to disk. So probably the best solution is to move the index stops after that. Alternately, BaseIndex::Stop can just call Commit directly. Or do both, because I don't see a downside to either approach.

@paulyc
Copy link
Author

paulyc commented Jan 7, 2020

Ah, @jimpo thanks for tracking that down as well. Must not have had enough coffee or something when I submitted this, but glad it helped find the issue, which clearly had to be a little more complex. ;) In the immortal words of Samwise Gamgee all's well as ends well, or at least it will be when a working PR is merged. Not sure if I'll have time to do it soon, but it's been there long enough, it can wait a little longer and I can test it properly, (why they do the tests...), unless someone else has time to fit it in somewhere but I wouldn't know

fanquake added a commit that referenced this pull request Jan 22, 2020
…callback.

9dd58ca init: Stop indexes on shutdown after ChainStateFlushed callback. (Jim Posen)

Pull request description:

  Replaces #17852.

  Currently, the latest index state may not be committed to disk on shutdown. The state is committed on `ChainStateFlushed` callbacks and the current init order unregisters the indexes as validation interfaces before the final `ChainStateFlushed` callback is called on them.

  Issue identified by paulyc.

  For review: an alternative or supplemental solution would be to call `Commit` at the end of `BaseIndex::Stop`. I don't see any harm in doing so and it makes the less prone to user error. However, the destructor would have to be modified to not call `Stop` because `Commit` calls a virtual method, so I figured it wasn't worth it. But I'm curious how others feel.

ACKs for top commit:
  fjahr:
    tested ACK 9dd58ca
  paulyc:
    > Code review ACK [9dd58ca](9dd58ca), but failed to test because I can't reproduce the original problem.
  kallewoof:
    Tested ACK 9dd58ca
  promag:
    Code review ACK 9dd58ca, but failed to test because I can't reproduce the original problem.

Tree-SHA512: 2918380b699833cb7eab07456d1667dbf8ebbe2d2b5988300a3cf5b6a6cfc818b6d9086e1936ffe7881f67e409306c4b91d61a08a169cfd0a301383479d4f3cb
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Jan 24, 2020
…lushed callback.

9dd58ca init: Stop indexes on shutdown after ChainStateFlushed callback. (Jim Posen)

Pull request description:

  Replaces bitcoin#17852.

  Currently, the latest index state may not be committed to disk on shutdown. The state is committed on `ChainStateFlushed` callbacks and the current init order unregisters the indexes as validation interfaces before the final `ChainStateFlushed` callback is called on them.

  Issue identified by paulyc.

  For review: an alternative or supplemental solution would be to call `Commit` at the end of `BaseIndex::Stop`. I don't see any harm in doing so and it makes the less prone to user error. However, the destructor would have to be modified to not call `Stop` because `Commit` calls a virtual method, so I figured it wasn't worth it. But I'm curious how others feel.

ACKs for top commit:
  fjahr:
    tested ACK 9dd58ca
  paulyc:
    > Code review ACK [9dd58ca](bitcoin@9dd58ca), but failed to test because I can't reproduce the original problem.
  kallewoof:
    Tested ACK 9dd58ca
  promag:
    Code review ACK 9dd58ca, but failed to test because I can't reproduce the original problem.

Tree-SHA512: 2918380b699833cb7eab07456d1667dbf8ebbe2d2b5988300a3cf5b6a6cfc818b6d9086e1936ffe7881f67e409306c4b91d61a08a169cfd0a301383479d4f3cb
sidhujag pushed a commit to syscoin-core/syscoin that referenced this pull request Nov 10, 2020
…lushed callback.

9dd58ca init: Stop indexes on shutdown after ChainStateFlushed callback. (Jim Posen)

Pull request description:

  Replaces bitcoin#17852.

  Currently, the latest index state may not be committed to disk on shutdown. The state is committed on `ChainStateFlushed` callbacks and the current init order unregisters the indexes as validation interfaces before the final `ChainStateFlushed` callback is called on them.

  Issue identified by paulyc.

  For review: an alternative or supplemental solution would be to call `Commit` at the end of `BaseIndex::Stop`. I don't see any harm in doing so and it makes the less prone to user error. However, the destructor would have to be modified to not call `Stop` because `Commit` calls a virtual method, so I figured it wasn't worth it. But I'm curious how others feel.

ACKs for top commit:
  fjahr:
    tested ACK 9dd58ca
  paulyc:
    > Code review ACK [9dd58ca](bitcoin@9dd58ca), but failed to test because I can't reproduce the original problem.
  kallewoof:
    Tested ACK 9dd58ca
  promag:
    Code review ACK 9dd58ca, but failed to test because I can't reproduce the original problem.

Tree-SHA512: 2918380b699833cb7eab07456d1667dbf8ebbe2d2b5988300a3cf5b6a6cfc818b6d9086e1936ffe7881f67e409306c4b91d61a08a169cfd0a301383479d4f3cb
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants