wallet: guard against dangling to-be-reverted db transactions #29253

furszy · 2024-01-16T00:41:40Z

Discovered while was reviewing #29112, specifically #29112 (review).

If the db handler that initiated the database transaction is destroyed,
the ongoing transaction cannot be left dangling when the db txn fails
to abort. It must be forcefully reverted; otherwise, any subsequent
db handler executing a write operation will dump the dangling,
to-be-reverted transaction data to disk.

This not only breaks the isolation property but also results in the
improper storage of incomplete information on disk, impacting
the wallet consistency.

This PR fixes the issue by resetting the db connection, automatically
rolling back the transaction (per https://www.sqlite.org/c3ref/close.html)
when the handler object is being destroyed and the txn abortion failed.

Testing Notes
Can verify the failure by reverting the fix e5217fe and running the test.
It will fail without e5217fe and pass with it.

DrahtBot · 2024-01-16T00:41:43Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	ryanofsky, achow101

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#29112 (sqlite: Disallow writing from multiple SQLiteBatchs by achow101)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

src/wallet/sqlite.cpp

ryanofsky · 2024-01-23T15:09:48Z

I'm trying to understand this, but confused about why executing "ROLLBACK TRANSACTION" would be expected to fail in sqlite. It seems like the fix in e5217fe is handling failure to roll back by trying to close and reopen the database. But I don't understand why rolling back a transaction would fail to begin with, unless there was some serious error like disk corruption. Probably to make the this fix clearer and safer it would be good to check the specific error code from the failed rollback instead of just checking res == SQLITE_OK. But I'm generally just confused.

furszy

I'm trying to understand this, but confused about why executing "ROLLBACK TRANSACTION" would be expected to fail in sqlite. It seems like the fix in e5217fe is handling failure to roll back by trying to close and reopen the database. But I don't understand why rolling back a transaction would fail to begin with, unless there was some serious error like disk corruption. Probably to make the this fix clearer and safer it would be good to check the specific error code from the failed rollback instead of just checking res == SQLITE_OK. But I'm generally just confused.

The current sqlite configuration doesn't allow concurrent database transactions (see #29112). This means that if there is a failure during the WalletBatch destruction (independently on if it is due to a fatal or transient and recoverable error, like a busy db or a loss of the db connection), the created transaction will remain active indefinitely, contrary to what all workflows expect. This situation provoques that any subsequent write operation on a new database handler (WalletBatch; it doesn't necessarily need to begin a new transaction, could also be a direct write) will dump information that was intended to be rolled back to disk.

Check #29112 (review) to see the test that originated this work.

ryanofsky · 2024-01-25T16:09:58Z

if there is a failure during the WalletBatch destruction (independently on if it is due to a fatal or transient and recoverable error, like a busy db or a loss of the db connection)

Yes I've started reviewing both PR's and the code changes seem reasonable, I just don't understand what motivated this PR, because "failure during the WalletBatch destruction" seems like something I would never expect to happen unless someone modified the code or manually corrupted the data on disk. So I'm wondering did we accidentally cause a breakage like that during development of these PRs? Or did you just notice that the error handling in the destructor wasn't very good, and decide to fix it? Or can failures to roll back actually happen in more cases than it seems, and this is a bug that could happen during normal operation without data corruption?

achow101 · 2024-01-25T16:33:31Z

seems like something I would never expect to happen unless someone modified the code or manually corrupted the data on disk. So I'm wondering did we accidentally cause a breakage like that during development of these PRs?

It shouldn't ever happen.

In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet?

ryanofsky

Code review d4ebac8. This seems like a nice change that could make the wallet more reliable, and adds good cleanup and test coverage.

Did not ack just because I noticed a bug in SQLiteBatch::TxnBegin if m_db is null, but none of my other comments are too important, so feel free to ignore them.

re: #29253 (comment)

In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet?

Yeah that sounds pretty close to ideal. Maybe not an urgent problem to solve, but could be nice.

src/wallet/sqlite.cpp

src/wallet/sqlite.h

src/wallet/sqlite.cpp

src/wallet/test/db_tests.cpp

furszy · 2024-01-25T22:47:39Z

I just don't understand what motivated this PR, because "failure during the WalletBatch destruction" seems like something I would never expect to happen unless someone modified the code or manually corrupted the data on disk. So I'm wondering did we accidentally cause a breakage like that during development of these PRs? Or did you just notice that the error handling in the destructor wasn't very good, and decide to fix it? Or can failures to roll back actually happen in more cases than it seems, and this is a bug that could happen during normal operation without data corruption?

Yeah, I just noted that the error handling in the destructor isn't optimal. That if this ever occurs in #29112, the introduced database mutex would lock indefinitely.
Then, I considered the same event in current master: which could lead to the writing of a dangling transaction to disk, which would corrupt the wallet and leave it at a non-recoverable point. Which is quite bad. And that motivated me to split the PR and force the txn rollback within the WalletBatch context as a last resource safety measure.

In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet?

It wouldn't be ideal if the error is still recoverable (e.g. SQLITE_BUSY if we ever introduce concurrent txns).
At least in this scenario, where we are destroying the object that created the transaction, I believe triggering the database connection reset is acceptable as a first measure. If it fails, then yeah, I agree that unloading the wallet instead of exiting the node would be desirable.

furszy

Updated per feedback, thanks @ryanofsky!

The main modification resides in the ExecHandler class, which is now designed to encapsulate all sqlite functions. This enables us to trigger different errors across the database sources; letting us improve the error-handling code and expand its test coverage. Furthermore, future encapsulation of sqlite functions within this class will contribute to minimizing the impact of any future API changes.

src/wallet/sqlite.cpp

src/wallet/sqlite.h

src/wallet/test/db_tests.cpp

ryanofsky

Code review ACK 493cfc7. I left more more comments about comments (feel free to ignore), but otherwise this looks good.

re: #29253 (comment)

In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet?

It wouldn't be ideal if the error is still recoverable (e.g. SQLITE_BUSY if we ever introduce concurrent txns).

In this case do we ever think we should see SQLITE_BUSY while trying to abandon a transaction? But yes in general we should treat expected errors and unexpected errors differently, and not handle expected errors by unloading the wallet.

src/wallet/sqlite.h

ryanofsky · 2024-01-29T18:28:52Z

src/wallet/sqlite.cpp

+            // If the transaction cannot be aborted, it may be due to the database connection being "busy", indicating a potential bug or data corruption.
+            // In such cases, attempt recovery by closing and reopening the database. This action will automatically roll back the transaction,
+            // ensuring that any changes made since the transaction was opened will be reverted. This prevents dangling transaction data
+            // outside the context of the 'SQLiteBatch' that initiated such a transaction.


In commit "sqlite: guard against dangling to-be-reverted db transactions" (0bf3a73)

I think this comment might be more harmful than helpful in its current form, because it isn't communicating the most important thing, which is that this code path is never expected to be hit under normal conditions, and if it is hit, it means something is seriously wrong and there has probably been data loss, if not data corruption. By saying this failure can happen when the connection is "busy", it makes it sound like database could just be waiting for a lock. But locking alone could never trigger this, so suggesting that it could raises doubts about the rest of the code, and makes the handling here seem to be unnecessary.

Also saying "this action will automatically rollback the transaction ensuring..." seems like an overstatement. Even if it's probably true, we are in the middle of recovering from some other significant bug or I/O error, so this comment seems to be going out on a limb to make a guarantee that isn't clearly significant. It is also unclear to me what "transaction data outside the context of the 'SQLiteBatch'" would refer to if SQLiteBatch is the class responsible for creating transactions.

Considering all this, I'd probably suggest still going with my original comment from #29253 (comment) which was "// If transaction cannot be aborted, it means there is a bug or there has been data corruption. Try to recover in this case by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction was opened will be rolled back."

But maybe this comment is missing some other information that you wanted to get across?

It is also unclear to me what "transaction data outside the context of the 'SQLiteBatch'" would refer to if SQLiteBatch is the class responsible for creating transactions.

The purpose of this sentence was to describe what could happen if we don't rollback the transaction and explain why we rollback the transaction inside the SQLiteBatch destructor.

The complete sentence, "This prevents dangling transaction data outside the context of the 'SQLiteBatch' that initiated such a transaction," refers to the SQLiteBatch instance, not the class. Each SQLiteBatch can create a single transaction at a time; no concurrent SQLiteBatch instances are allowed (#29112). Thus, the code rolls back the db txn within the SQLiteBatch object destructor. This links the db txn to the SQLiteBatch instance lifecycle, isolating data changes and preventing other SQLiteBatch instances from dumping data that does not correspond to them.

But maybe this comment is missing some other information that you wanted to get across?

I wanted to emphasize the importance of rolling back the transaction by explaining the consequences of leaving the transaction dangling. Also, wanted to mention that the error could be transient (so we don't forget about this possibility) and describe why we want to keep the transaction data changes isolated within the SQLiteBatch instance that initiated the transaction.

If, after reading this response, you still have doubts and find the comment confusing, I can replace it with your comment. No problem; the idea was to provide a clearer description of the situation.

re: #29253 (comment)

Thanks for clarifying.

I wanted to emphasize the importance of rolling back the transaction by explaining the consequences of leaving the transaction dangling.

That seems ok. Would maybe suggest adding to the previously suggested comment:

// If transaction cannot be aborted, it means there is a bug or there has been data corruption. Try to recover in this case by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction was opened will be rolled back and future transactions can succeed without committing old data.

Also, wanted to mention that the error could be transient (so we don't forget about this possibility)

This is exactly what I think the comment should not be saying because the error cannot be transient. And it's confusing to suggest that it could be transient, because it either makes the rest of the code seem broken, or it makes this code seem unnecessary.

If you want talk about a theoretical future state, in some future version of sqlite or future version of the wallet codebase where a rollback statement could return SQLITE_BUSY, I guess you could do that but it seems speculative and confusing if you don't distinguish that scenario with what is actually happening in the current code.

and describe why we want to keep the transaction data changes isolated within the SQLiteBatch instance that initiated the transaction.

I think that's mostly clear from the context but if you want to say more about it, the best place to put it would be on line 410 before the TxnAbort() call is made, to explain why the TxnAbort() call is being made and why it's important. Trying to get into why the rollback is important after the rolllback fails seems like going down a tangent and not putting information where it is most helpful.

All good, applied your suggestion. Thanks!

This is exactly what I think the comment should not be saying because the error cannot be transient. And it's confusing to suggest that it could be transient, because it either makes the rest of the code seem broken, or it makes this code seem unnecessary.

I was also thinking about an I/O hardware malfunctioning error, which "could" be transient; sqlite has return codes for it. But it's probably better to abort the program if something like this ever happens.

src/wallet/test/db_tests.cpp

ryanofsky

Code review ACK 756bed0. New update consolidating the exec classes looks good.

src/wallet/sqlite.h

src/wallet/sqlite.cpp

ryanofsky · 2024-01-30T19:44:52Z

src/wallet/sqlite.cpp

+            // If the transaction cannot be aborted, it may be due to the database connection being "busy", indicating a potential bug or data corruption.
+            // In such cases, attempt recovery by closing and reopening the database. This action will automatically roll back the transaction,
+            // ensuring that any changes made since the transaction was opened will be reverted. This prevents dangling transaction data
+            // outside the context of the 'SQLiteBatch' that initiated such a transaction.


re: #29253 (comment)

Thanks for clarifying.

I wanted to emphasize the importance of rolling back the transaction by explaining the consequences of leaving the transaction dangling.

That seems ok. Would maybe suggest adding to the previously suggested comment:

// If transaction cannot be aborted, it means there is a bug or there has been data corruption. Try to recover in this case by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction was opened will be rolled back and future transactions can succeed without committing old data.

Also, wanted to mention that the error could be transient (so we don't forget about this possibility)

This is exactly what I think the comment should not be saying because the error cannot be transient. And it's confusing to suggest that it could be transient, because it either makes the rest of the code seem broken, or it makes this code seem unnecessary.

If you want talk about a theoretical future state, in some future version of sqlite or future version of the wallet codebase where a rollback statement could return SQLITE_BUSY, I guess you could do that but it seems speculative and confusing if you don't distinguish that scenario with what is actually happening in the current code.

and describe why we want to keep the transaction data changes isolated within the SQLiteBatch instance that initiated the transaction.

I think that's mostly clear from the context but if you want to say more about it, the best place to put it would be on line 410 before the TxnAbort() call is made, to explain why the TxnAbort() call is being made and why it's important. Trying to get into why the rollback is important after the rolllback fails seems like going down a tangent and not putting information where it is most helpful.

By encapsulating sqlite3_exec into its own standalone method and introducing the 'SQliteExecHandler' class, we enable the ability to test db statements execution failures within the unit test framework. This is used in the following-up commit to exercise a deadlock and improve our wallet db error handling code. Moreover, the future encapsulation of other sqlite functions within this class will contribute to minimize the impact of any future API changes.

Util function to clean up code and let us verify, in the following-up commit, that dangling, to-be-reverted db transactions cannot occur anymore.

If the handler that initiated the database transaction is destroyed, the ongoing transaction cannot be left dangling when the db txn fails to abort. It must be forcefully reversed; otherwise, any subsequent db handler executing a write operation will dump the dangling, to-be-reverted transaction data to disk. This not only breaks the database isolation property but also results in the improper storage of incomplete information on disk, impacting the wallet consistency.

furszy

Updated per feedback, thanks ryanofsky!
Docstrings were improved, and a bug related to a null SQLiteExecHandler has been fixed.
Small diff.

ryanofsky

Code review ACK b298242. Just fix for exec result codes and comment update since last review.

Overall this change should help wallet recover and not cause more problems if a rollback fails, and it adds some nice test coverage.

ryanofsky · 2024-01-31T16:11:15Z

src/wallet/sqlite.h

@@ -61,6 +71,8 @@ class SQLiteBatch : public DatabaseBatch
    explicit SQLiteBatch(SQLiteDatabase& database);
    ~SQLiteBatch() override { Close(); }

+    void SetExecHandler(std::unique_ptr<SQliteExecHandler>&& handler) { m_exec_handler = std::move(handler); }


In commit "sqlite: add ability to interrupt statements" (dca874e)

It's a little better to pass plain unique_ptr instead of unique_ptr&&, because former guarantees unique_ptr argument will be moved-from and null afterward, while latter means the move may or may not happen depending on internals of the function. In general you want explicit && references for template programming and perfect forwarding, but should use values and & references otherwise.. Can see https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rf-conventional for guidelines.

ok noted, thanks!

achow101 · 2024-01-31T18:55:24Z

It wouldn't be ideal if the error is still recoverable (e.g. SQLITE_BUSY if we ever introduce concurrent txns).
At least in this scenario, where we are destroying the object that created the transaction, I believe triggering the database connection reset is acceptable as a first measure.

If we do have recoverable errors, then we should be handling them, not hitting them with the hammer of closing and reopening the database.

This PR feels like it's the wrong approach to dealing with this particular database failure. It shouldn't be possible for a transaction commit or abort to currently fail in way that isn't catastrophic. In that case, we should be shutting the wallet rather than reopening the database connection and pretending that nothing happened. This could potentially lead to a state where the wallet's in-memory state does not match the database, which is really bad and could lead to loss of funds.

ryanofsky · 2024-01-31T19:24:34Z

This PR feels like it's the wrong approach to dealing with this particular database failure. It shouldn't be possible for a transaction commit or abort to currently fail in way that isn't catastrophic. In that case, we should be shutting the wallet rather than reopening the database connection and pretending that nothing happened. This could potentially lead to a state where the wallet's in-memory state does not match the database, which is really bad and could lead to loss of funds.

It seems like if there is failure to rollback a transaction one way, with a rollback statement, this PR is just retrying again a different way, by closing and reopening the database. In theory the code could always roll back transactions by closing and reopening the database, so the new behavior in this PR doesn't seem worse than current behavior.

One way the PR seems better than the current code is that the current code just logs "SQLiteBatch: Batch closed and failed to abort transaction" if the rollback fails, which is not even an obvious error message, and it continues like nothing happened. The new code at least raises an exception if the rollback fails.

All of this seems very theoretical to me since I have no idea why a rollback statement would ever fail. But I think this PR seems reasonable, adds some test coverage, and probably is a small improvement even if better approaches are also possible.

achow101 · 2024-01-31T20:02:16Z

ACK b298242

Github-Pull: bitcoin#29253 Rebased-From: fdf9f66

By encapsulating sqlite3_exec into its own standalone method and introducing the 'SQliteExecHandler' class, we enable the ability to test db statements execution failures within the unit test framework. This is used in the following-up commit to exercise a deadlock and improve our wallet db error handling code. Moreover, the future encapsulation of other sqlite functions within this class will contribute to minimize the impact of any future API changes. Github-Pull: bitcoin#29253 Rebased-From: dca874e

Util function to clean up code and let us verify, in the following-up commit, that dangling, to-be-reverted db transactions cannot occur anymore. Github-Pull: bitcoin#29253 Rebased-From: 472d2ca

If the handler that initiated the database transaction is destroyed, the ongoing transaction cannot be left dangling when the db txn fails to abort. It must be forcefully reversed; otherwise, any subsequent db handler executing a write operation will dump the dangling, to-be-reverted transaction data to disk. This not only breaks the database isolation property but also results in the improper storage of incomplete information on disk, impacting the wallet consistency. Github-Pull: bitcoin#29253 Rebased-From: fc0e747

Github-Pull: bitcoin#29253 Rebased-From: b298242

test: wallet db, exercise deadlock after write failure

fdf9f66

DrahtBot added the Wallet label Jan 16, 2024

furszy mentioned this pull request Jan 16, 2024

sqlite: Disallow writing from multiple SQLiteBatchs #29112

Merged

DrahtBot mentioned this pull request Jan 16, 2024

Remove the legacy wallet and BDB dependency #28710

Merged

achow101 reviewed Jan 16, 2024

View reviewed changes

src/wallet/sqlite.cpp Show resolved Hide resolved

furszy force-pushed the 2024_wallet_db_dangling_txn branch from fb9e1b3 to d4ebac8 Compare January 16, 2024 20:19

achow101 added this to the 27.0 milestone Jan 16, 2024

bitcoin deleted a comment from moemat12 Jan 22, 2024

furszy commented Jan 25, 2024

View reviewed changes

ryanofsky reviewed Jan 25, 2024

View reviewed changes

furszy force-pushed the 2024_wallet_db_dangling_txn branch from d4ebac8 to 493cfc7 Compare January 26, 2024 13:07

furszy commented Jan 26, 2024

View reviewed changes

ryanofsky approved these changes Jan 29, 2024

View reviewed changes

furszy force-pushed the 2024_wallet_db_dangling_txn branch from 493cfc7 to 756bed0 Compare January 30, 2024 13:26

ryanofsky approved these changes Jan 30, 2024

View reviewed changes

furszy added 4 commits January 30, 2024 17:26

sqlite: introduce HasActiveTxn method

472d2ca

Util function to clean up code and let us verify, in the following-up commit, that dangling, to-be-reverted db transactions cannot occur anymore.

test: sqlite, add coverage for dangling to-be-reverted db txns

b298242

furszy force-pushed the 2024_wallet_db_dangling_txn branch from 756bed0 to b298242 Compare January 30, 2024 20:31

furszy commented Jan 30, 2024

View reviewed changes

DrahtBot added the CI failed label Jan 30, 2024

ryanofsky approved these changes Jan 31, 2024

View reviewed changes

achow101 merged commit a01da41 into bitcoin:master Jan 31, 2024

furszy deleted the 2024_wallet_db_dangling_txn branch February 1, 2024 12:48

luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Mar 14, 2024

test: wallet db, exercise deadlock after write failure

967571d

Github-Pull: bitcoin#29253 Rebased-From: fdf9f66

luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Mar 14, 2024

test: sqlite, add coverage for dangling to-be-reverted db txns

09d1192

Github-Pull: bitcoin#29253 Rebased-From: b298242

luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Mar 14, 2024

test: sqlite, add coverage for dangling to-be-reverted db txns

bb3bb89

Github-Pull: bitcoin#29253 Rebased-From: b298242

bitcoin locked and limited conversation to collaborators Jan 31, 2025

wallet: guard against dangling to-be-reverted db transactions #29253

wallet: guard against dangling to-be-reverted db transactions #29253

Uh oh!

Conversation

furszy commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage

Reviews

Conflicts

Uh oh!

Uh oh!

ryanofsky commented Jan 23, 2024

Uh oh!

furszy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanofsky commented Jan 25, 2024

Uh oh!

achow101 commented Jan 25, 2024

Uh oh!

ryanofsky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

furszy commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

furszy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanofsky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ryanofsky Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

furszy Jan 30, 2024

Choose a reason for hiding this comment

Uh oh!

ryanofsky Jan 30, 2024

Choose a reason for hiding this comment

Uh oh!

furszy Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ryanofsky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ryanofsky Jan 30, 2024

Choose a reason for hiding this comment

Uh oh!

furszy left a comment

Choose a reason for hiding this comment

Uh oh!

ryanofsky left a comment

Choose a reason for hiding this comment

furszy commented Jan 16, 2024 •

edited

Loading

DrahtBot commented Jan 16, 2024 •

edited

Loading

furszy left a comment •

edited

Loading

furszy commented Jan 25, 2024 •

edited

Loading

furszy left a comment •

edited

Loading

ryanofsky Jan 29, 2024 •

edited

Loading

furszy Jan 30, 2024 •

edited

Loading

achow101 commented Jan 31, 2024 •

edited

Loading