Skip to content

Conversation

furszy
Copy link
Member

@furszy furszy commented Jan 16, 2024

Discovered while was reviewing #29112, specifically #29112 (review).

If the db handler that initiated the database transaction is destroyed,
the ongoing transaction cannot be left dangling when the db txn fails
to abort. It must be forcefully reverted; otherwise, any subsequent
db handler executing a write operation will dump the dangling,
to-be-reverted transaction data to disk.

This not only breaks the isolation property but also results in the
improper storage of incomplete information on disk, impacting
the wallet consistency.

This PR fixes the issue by resetting the db connection, automatically
rolling back the transaction (per https://www.sqlite.org/c3ref/close.html)
when the handler object is being destroyed and the txn abortion failed.

Testing Notes
Can verify the failure by reverting the fix e5217fe and running the test.
It will fail without e5217fe and pass with it.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jan 16, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK ryanofsky, achow101

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #29112 (sqlite: Disallow writing from multiple SQLiteBatchs by achow101)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@furszy furszy force-pushed the 2024_wallet_db_dangling_txn branch from fb9e1b3 to d4ebac8 Compare January 16, 2024 20:19
@achow101 achow101 added this to the 27.0 milestone Jan 16, 2024
@bitcoin bitcoin deleted a comment from moemat12 Jan 22, 2024
@ryanofsky
Copy link
Contributor

I'm trying to understand this, but confused about why executing "ROLLBACK TRANSACTION" would be expected to fail in sqlite. It seems like the fix in e5217fe is handling failure to roll back by trying to close and reopen the database. But I don't understand why rolling back a transaction would fail to begin with, unless there was some serious error like disk corruption. Probably to make the this fix clearer and safer it would be good to check the specific error code from the failed rollback instead of just checking res == SQLITE_OK. But I'm generally just confused.

Copy link
Member Author

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand this, but confused about why executing "ROLLBACK TRANSACTION" would be expected to fail in sqlite. It seems like the fix in e5217fe is handling failure to roll back by trying to close and reopen the database. But I don't understand why rolling back a transaction would fail to begin with, unless there was some serious error like disk corruption. Probably to make the this fix clearer and safer it would be good to check the specific error code from the failed rollback instead of just checking res == SQLITE_OK. But I'm generally just confused.

The current sqlite configuration doesn't allow concurrent database transactions (see #29112). This means that if there is a failure during the WalletBatch destruction (independently on if it is due to a fatal or transient and recoverable error, like a busy db or a loss of the db connection), the created transaction will remain active indefinitely, contrary to what all workflows expect. This situation provoques that any subsequent write operation on a new database handler (WalletBatch; it doesn't necessarily need to begin a new transaction, could also be a direct write) will dump information that was intended to be rolled back to disk.

Check #29112 (review) to see the test that originated this work.

@ryanofsky
Copy link
Contributor

if there is a failure during the WalletBatch destruction (independently on if it is due to a fatal or transient and recoverable error, like a busy db or a loss of the db connection)

Yes I've started reviewing both PR's and the code changes seem reasonable, I just don't understand what motivated this PR, because "failure during the WalletBatch destruction" seems like something I would never expect to happen unless someone modified the code or manually corrupted the data on disk. So I'm wondering did we accidentally cause a breakage like that during development of these PRs? Or did you just notice that the error handling in the destructor wasn't very good, and decide to fix it? Or can failures to roll back actually happen in more cases than it seems, and this is a bug that could happen during normal operation without data corruption?

@achow101
Copy link
Member

seems like something I would never expect to happen unless someone modified the code or manually corrupted the data on disk. So I'm wondering did we accidentally cause a breakage like that during development of these PRs?

It shouldn't ever happen.

In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet?

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review d4ebac8. This seems like a nice change that could make the wallet more reliable, and adds good cleanup and test coverage.

Did not ack just because I noticed a bug in SQLiteBatch::TxnBegin if m_db is null, but none of my other comments are too important, so feel free to ignore them.

re: #29253 (comment)

In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet?

Yeah that sounds pretty close to ideal. Maybe not an urgent problem to solve, but could be nice.

@furszy
Copy link
Member Author

furszy commented Jan 25, 2024

I just don't understand what motivated this PR, because "failure during the WalletBatch destruction" seems like something I would never expect to happen unless someone modified the code or manually corrupted the data on disk. So I'm wondering did we accidentally cause a breakage like that during development of these PRs? Or did you just notice that the error handling in the destructor wasn't very good, and decide to fix it? Or can failures to roll back actually happen in more cases than it seems, and this is a bug that could happen during normal operation without data corruption?

Yeah, I just noted that the error handling in the destructor isn't optimal. That if this ever occurs in #29112, the introduced database mutex would lock indefinitely.
Then, I considered the same event in current master: which could lead to the writing of a dangling transaction to disk, which would corrupt the wallet and leave it at a non-recoverable point. Which is quite bad. And that motivated me to split the PR and force the txn rollback within the WalletBatch context as a last resource safety measure.

In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet?

It wouldn't be ideal if the error is still recoverable (e.g. SQLITE_BUSY if we ever introduce concurrent txns).
At least in this scenario, where we are destroying the object that created the transaction, I believe triggering the database connection reset is acceptable as a first measure. If it fails, then yeah, I agree that unloading the wallet instead of exiting the node would be desirable.

@furszy furszy force-pushed the 2024_wallet_db_dangling_txn branch from d4ebac8 to 493cfc7 Compare January 26, 2024 13:07
Copy link
Member Author

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated per feedback, thanks @ryanofsky!

The main modification resides in the ExecHandler class, which is now designed to encapsulate all sqlite functions. This enables us to trigger different errors across the database sources; letting us improve the error-handling code and expand its test coverage. Furthermore, future encapsulation of sqlite functions within this class will contribute to minimizing the impact of any future API changes.

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 493cfc7. I left more more comments about comments (feel free to ignore), but otherwise this looks good.

re: #29253 (comment)

In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet?

It wouldn't be ideal if the error is still recoverable (e.g. SQLITE_BUSY if we ever introduce concurrent txns).

In this case do we ever think we should see SQLITE_BUSY while trying to abandon a transaction? But yes in general we should treat expected errors and unexpected errors differently, and not handle expected errors by unloading the wallet.

Comment on lines 415 to 418
// If the transaction cannot be aborted, it may be due to the database connection being "busy", indicating a potential bug or data corruption.
// In such cases, attempt recovery by closing and reopening the database. This action will automatically roll back the transaction,
// ensuring that any changes made since the transaction was opened will be reverted. This prevents dangling transaction data
// outside the context of the 'SQLiteBatch' that initiated such a transaction.
Copy link
Contributor

@ryanofsky ryanofsky Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "sqlite: guard against dangling to-be-reverted db transactions" (0bf3a73)

I think this comment might be more harmful than helpful in its current form, because it isn't communicating the most important thing, which is that this code path is never expected to be hit under normal conditions, and if it is hit, it means something is seriously wrong and there has probably been data loss, if not data corruption. By saying this failure can happen when the connection is "busy", it makes it sound like database could just be waiting for a lock. But locking alone could never trigger this, so suggesting that it could raises doubts about the rest of the code, and makes the handling here seem to be unnecessary.

Also saying "this action will automatically rollback the transaction ensuring..." seems like an overstatement. Even if it's probably true, we are in the middle of recovering from some other significant bug or I/O error, so this comment seems to be going out on a limb to make a guarantee that isn't clearly significant. It is also unclear to me what "transaction data outside the context of the 'SQLiteBatch'" would refer to if SQLiteBatch is the class responsible for creating transactions.

Considering all this, I'd probably suggest still going with my original comment from #29253 (comment) which was "// If transaction cannot be aborted, it means there is a bug or there has been data corruption. Try to recover in this case by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction was opened will be rolled back."

But maybe this comment is missing some other information that you wanted to get across?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also unclear to me what "transaction data outside the context of the 'SQLiteBatch'" would refer to if SQLiteBatch is the class responsible for creating transactions.

The purpose of this sentence was to describe what could happen if we don't rollback the transaction and explain why we rollback the transaction inside the SQLiteBatch destructor.

The complete sentence, "This prevents dangling transaction data outside the context of the 'SQLiteBatch' that initiated such a transaction," refers to the SQLiteBatch instance, not the class. Each SQLiteBatch can create a single transaction at a time; no concurrent SQLiteBatch instances are allowed (#29112). Thus, the code rolls back the db txn within the SQLiteBatch object destructor. This links the db txn to the SQLiteBatch instance lifecycle, isolating data changes and preventing other SQLiteBatch instances from dumping data that does not correspond to them.

But maybe this comment is missing some other information that you wanted to get across?

I wanted to emphasize the importance of rolling back the transaction by explaining the consequences of leaving the transaction dangling. Also, wanted to mention that the error could be transient (so we don't forget about this possibility) and describe why we want to keep the transaction data changes isolated within the SQLiteBatch instance that initiated the transaction.


If, after reading this response, you still have doubts and find the comment confusing, I can replace it with your comment. No problem; the idea was to provide a clearer description of the situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #29253 (comment)

Thanks for clarifying.

I wanted to emphasize the importance of rolling back the transaction by explaining the consequences of leaving the transaction dangling.

That seems ok. Would maybe suggest adding to the previously suggested comment:

  • // If transaction cannot be aborted, it means there is a bug or there has been data corruption. Try to recover in this case by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction was opened will be rolled back and future transactions can succeed without committing old data.

Also, wanted to mention that the error could be transient (so we don't forget about this possibility)

This is exactly what I think the comment should not be saying because the error cannot be transient. And it's confusing to suggest that it could be transient, because it either makes the rest of the code seem broken, or it makes this code seem unnecessary.

If you want talk about a theoretical future state, in some future version of sqlite or future version of the wallet codebase where a rollback statement could return SQLITE_BUSY, I guess you could do that but it seems speculative and confusing if you don't distinguish that scenario with what is actually happening in the current code.

and describe why we want to keep the transaction data changes isolated within the SQLiteBatch instance that initiated the transaction.

I think that's mostly clear from the context but if you want to say more about it, the best place to put it would be on line 410 before the TxnAbort() call is made, to explain why the TxnAbort() call is being made and why it's important. Trying to get into why the rollback is important after the rolllback fails seems like going down a tangent and not putting information where it is most helpful.

Copy link
Member Author

@furszy furszy Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, applied your suggestion. Thanks!

This is exactly what I think the comment should not be saying because the error cannot be transient. And it's confusing to suggest that it could be transient, because it either makes the rest of the code seem broken, or it makes this code seem unnecessary.

I was also thinking about an I/O hardware malfunctioning error, which "could" be transient; sqlite has return codes for it. But it's probably better to abort the program if something like this ever happens.

@furszy furszy force-pushed the 2024_wallet_db_dangling_txn branch from 493cfc7 to 756bed0 Compare January 30, 2024 13:26
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 756bed0. New update consolidating the exec classes looks good.

Comment on lines 415 to 418
// If the transaction cannot be aborted, it may be due to the database connection being "busy", indicating a potential bug or data corruption.
// In such cases, attempt recovery by closing and reopening the database. This action will automatically roll back the transaction,
// ensuring that any changes made since the transaction was opened will be reverted. This prevents dangling transaction data
// outside the context of the 'SQLiteBatch' that initiated such a transaction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #29253 (comment)

Thanks for clarifying.

I wanted to emphasize the importance of rolling back the transaction by explaining the consequences of leaving the transaction dangling.

That seems ok. Would maybe suggest adding to the previously suggested comment:

  • // If transaction cannot be aborted, it means there is a bug or there has been data corruption. Try to recover in this case by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction was opened will be rolled back and future transactions can succeed without committing old data.

Also, wanted to mention that the error could be transient (so we don't forget about this possibility)

This is exactly what I think the comment should not be saying because the error cannot be transient. And it's confusing to suggest that it could be transient, because it either makes the rest of the code seem broken, or it makes this code seem unnecessary.

If you want talk about a theoretical future state, in some future version of sqlite or future version of the wallet codebase where a rollback statement could return SQLITE_BUSY, I guess you could do that but it seems speculative and confusing if you don't distinguish that scenario with what is actually happening in the current code.

and describe why we want to keep the transaction data changes isolated within the SQLiteBatch instance that initiated the transaction.

I think that's mostly clear from the context but if you want to say more about it, the best place to put it would be on line 410 before the TxnAbort() call is made, to explain why the TxnAbort() call is being made and why it's important. Trying to get into why the rollback is important after the rolllback fails seems like going down a tangent and not putting information where it is most helpful.

By encapsulating sqlite3_exec into its own standalone method
and introducing the 'SQliteExecHandler' class, we enable the
ability to test db statements execution failures within the
unit test framework.

This is used in the following-up commit to exercise a deadlock
and improve our wallet db error handling code.

Moreover, the future encapsulation of other sqlite functions
within this class will contribute to minimize the impact of
any future API changes.
Util function to clean up code and let us
verify, in the following-up commit, that dangling,
to-be-reverted db transactions cannot occur anymore.
If the handler that initiated the database transaction is destroyed,
the ongoing transaction cannot be left dangling when the db txn fails
to abort. It must be forcefully reversed; otherwise, any subsequent
db handler executing a write operation will dump the dangling,
to-be-reverted transaction data to disk.

This not only breaks the database isolation property but also results
in the improper storage of incomplete information on disk, impacting
the wallet consistency.
@furszy furszy force-pushed the 2024_wallet_db_dangling_txn branch from 756bed0 to b298242 Compare January 30, 2024 20:31
Copy link
Member Author

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated per feedback, thanks ryanofsky!
Docstrings were improved, and a bug related to a null SQLiteExecHandler has been fixed.
Small diff.

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK b298242. Just fix for exec result codes and comment update since last review.

Overall this change should help wallet recover and not cause more problems if a rollback fails, and it adds some nice test coverage.

@@ -61,6 +71,8 @@ class SQLiteBatch : public DatabaseBatch
explicit SQLiteBatch(SQLiteDatabase& database);
~SQLiteBatch() override { Close(); }

void SetExecHandler(std::unique_ptr<SQliteExecHandler>&& handler) { m_exec_handler = std::move(handler); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "sqlite: add ability to interrupt statements" (dca874e)

It's a little better to pass plain unique_ptr instead of unique_ptr&&, because former guarantees unique_ptr argument will be moved-from and null afterward, while latter means the move may or may not happen depending on internals of the function. In general you want explicit && references for template programming and perfect forwarding, but should use values and & references otherwise.. Can see https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rf-conventional for guidelines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok noted, thanks!

@achow101
Copy link
Member

achow101 commented Jan 31, 2024

It wouldn't be ideal if the error is still recoverable (e.g. SQLITE_BUSY if we ever introduce concurrent txns).
At least in this scenario, where we are destroying the object that created the transaction, I believe triggering the database connection reset is acceptable as a first measure.

If we do have recoverable errors, then we should be handling them, not hitting them with the hammer of closing and reopening the database.


This PR feels like it's the wrong approach to dealing with this particular database failure. It shouldn't be possible for a transaction commit or abort to currently fail in way that isn't catastrophic. In that case, we should be shutting the wallet rather than reopening the database connection and pretending that nothing happened. This could potentially lead to a state where the wallet's in-memory state does not match the database, which is really bad and could lead to loss of funds.

@ryanofsky
Copy link
Contributor

This PR feels like it's the wrong approach to dealing with this particular database failure. It shouldn't be possible for a transaction commit or abort to currently fail in way that isn't catastrophic. In that case, we should be shutting the wallet rather than reopening the database connection and pretending that nothing happened. This could potentially lead to a state where the wallet's in-memory state does not match the database, which is really bad and could lead to loss of funds.

It seems like if there is failure to rollback a transaction one way, with a rollback statement, this PR is just retrying again a different way, by closing and reopening the database. In theory the code could always roll back transactions by closing and reopening the database, so the new behavior in this PR doesn't seem worse than current behavior.

One way the PR seems better than the current code is that the current code just logs "SQLiteBatch: Batch closed and failed to abort transaction" if the rollback fails, which is not even an obvious error message, and it continues like nothing happened. The new code at least raises an exception if the rollback fails.

All of this seems very theoretical to me since I have no idea why a rollback statement would ever fail. But I think this PR seems reasonable, adds some test coverage, and probably is a small improvement even if better approaches are also possible.

@achow101
Copy link
Member

ACK b298242

@achow101 achow101 merged commit a01da41 into bitcoin:master Jan 31, 2024
@furszy furszy deleted the 2024_wallet_db_dangling_txn branch February 1, 2024 12:48
luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Mar 14, 2024
luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Mar 14, 2024
By encapsulating sqlite3_exec into its own standalone method
and introducing the 'SQliteExecHandler' class, we enable the
ability to test db statements execution failures within the
unit test framework.

This is used in the following-up commit to exercise a deadlock
and improve our wallet db error handling code.

Moreover, the future encapsulation of other sqlite functions
within this class will contribute to minimize the impact of
any future API changes.

Github-Pull: bitcoin#29253
Rebased-From: dca874e
luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Mar 14, 2024
Util function to clean up code and let us
verify, in the following-up commit, that dangling,
to-be-reverted db transactions cannot occur anymore.

Github-Pull: bitcoin#29253
Rebased-From: 472d2ca
luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Mar 14, 2024
If the handler that initiated the database transaction is destroyed,
the ongoing transaction cannot be left dangling when the db txn fails
to abort. It must be forcefully reversed; otherwise, any subsequent
db handler executing a write operation will dump the dangling,
to-be-reverted transaction data to disk.

This not only breaks the database isolation property but also results
in the improper storage of incomplete information on disk, impacting
the wallet consistency.

Github-Pull: bitcoin#29253
Rebased-From: fc0e747
luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Mar 14, 2024
luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Mar 14, 2024
@bitcoin bitcoin locked and limited conversation to collaborators Jan 31, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants