-
Notifications
You must be signed in to change notification settings - Fork 37.7k
wallet: guard against dangling to-be-reverted db transactions #29253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
fb9e1b3
to
d4ebac8
Compare
I'm trying to understand this, but confused about why executing "ROLLBACK TRANSACTION" would be expected to fail in sqlite. It seems like the fix in e5217fe is handling failure to roll back by trying to close and reopen the database. But I don't understand why rolling back a transaction would fail to begin with, unless there was some serious error like disk corruption. Probably to make the this fix clearer and safer it would be good to check the specific error code from the failed rollback instead of just checking res == SQLITE_OK. But I'm generally just confused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand this, but confused about why executing "ROLLBACK TRANSACTION" would be expected to fail in sqlite. It seems like the fix in e5217fe is handling failure to roll back by trying to close and reopen the database. But I don't understand why rolling back a transaction would fail to begin with, unless there was some serious error like disk corruption. Probably to make the this fix clearer and safer it would be good to check the specific error code from the failed rollback instead of just checking res == SQLITE_OK. But I'm generally just confused.
The current sqlite configuration doesn't allow concurrent database transactions (see #29112). This means that if there is a failure during the WalletBatch destruction (independently on if it is due to a fatal or transient and recoverable error, like a busy db or a loss of the db connection), the created transaction will remain active indefinitely, contrary to what all workflows expect. This situation provoques that any subsequent write operation on a new database handler (WalletBatch; it doesn't necessarily need to begin a new transaction, could also be a direct write) will dump information that was intended to be rolled back to disk.
Check #29112 (review) to see the test that originated this work.
Yes I've started reviewing both PR's and the code changes seem reasonable, I just don't understand what motivated this PR, because "failure during the WalletBatch destruction" seems like something I would never expect to happen unless someone modified the code or manually corrupted the data on disk. So I'm wondering did we accidentally cause a breakage like that during development of these PRs? Or did you just notice that the error handling in the destructor wasn't very good, and decide to fix it? Or can failures to roll back actually happen in more cases than it seems, and this is a bug that could happen during normal operation without data corruption? |
It shouldn't ever happen. In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review d4ebac8. This seems like a nice change that could make the wallet more reliable, and adds good cleanup and test coverage.
Did not ack just because I noticed a bug in SQLiteBatch::TxnBegin if m_db
is null, but none of my other comments are too important, so feel free to ignore them.
re: #29253 (comment)
In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet?
Yeah that sounds pretty close to ideal. Maybe not an urgent problem to solve, but could be nice.
Yeah, I just noted that the error handling in the destructor isn't optimal. That if this ever occurs in #29112, the introduced database mutex would lock indefinitely.
It wouldn't be ideal if the error is still recoverable (e.g. SQLITE_BUSY if we ever introduce concurrent txns). |
d4ebac8
to
493cfc7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated per feedback, thanks @ryanofsky!
The main modification resides in the ExecHandler
class, which is now designed to encapsulate all sqlite functions. This enables us to trigger different errors across the database sources; letting us improve the error-handling code and expand its test coverage. Furthermore, future encapsulation of sqlite functions within this class will contribute to minimizing the impact of any future API changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 493cfc7. I left more more comments about comments (feel free to ignore), but otherwise this looks good.
re: #29253 (comment)
In general, we don't have very robust error handling for when the database fails to read or write something. I think this should actually be more like an assert, but one that doesn't kill the whole software, maybe just unloads the wallet?
It wouldn't be ideal if the error is still recoverable (e.g. SQLITE_BUSY if we ever introduce concurrent txns).
In this case do we ever think we should see SQLITE_BUSY while trying to abandon a transaction? But yes in general we should treat expected errors and unexpected errors differently, and not handle expected errors by unloading the wallet.
src/wallet/sqlite.cpp
Outdated
// If the transaction cannot be aborted, it may be due to the database connection being "busy", indicating a potential bug or data corruption. | ||
// In such cases, attempt recovery by closing and reopening the database. This action will automatically roll back the transaction, | ||
// ensuring that any changes made since the transaction was opened will be reverted. This prevents dangling transaction data | ||
// outside the context of the 'SQLiteBatch' that initiated such a transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "sqlite: guard against dangling to-be-reverted db transactions" (0bf3a73)
I think this comment might be more harmful than helpful in its current form, because it isn't communicating the most important thing, which is that this code path is never expected to be hit under normal conditions, and if it is hit, it means something is seriously wrong and there has probably been data loss, if not data corruption. By saying this failure can happen when the connection is "busy", it makes it sound like database could just be waiting for a lock. But locking alone could never trigger this, so suggesting that it could raises doubts about the rest of the code, and makes the handling here seem to be unnecessary.
Also saying "this action will automatically rollback the transaction ensuring..." seems like an overstatement. Even if it's probably true, we are in the middle of recovering from some other significant bug or I/O error, so this comment seems to be going out on a limb to make a guarantee that isn't clearly significant. It is also unclear to me what "transaction data outside the context of the 'SQLiteBatch'" would refer to if SQLiteBatch is the class responsible for creating transactions.
Considering all this, I'd probably suggest still going with my original comment from #29253 (comment) which was "// If transaction cannot be aborted, it means there is a bug or there has been data corruption. Try to recover in this case by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction was opened will be rolled back."
But maybe this comment is missing some other information that you wanted to get across?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also unclear to me what "transaction data outside the context of the 'SQLiteBatch'" would refer to if SQLiteBatch is the class responsible for creating transactions.
The purpose of this sentence was to describe what could happen if we don't rollback the transaction and explain why we rollback the transaction inside the SQLiteBatch
destructor.
The complete sentence, "This prevents dangling transaction data outside the context of the 'SQLiteBatch' that initiated such a transaction," refers to the SQLiteBatch
instance, not the class. Each SQLiteBatch
can create a single transaction at a time; no concurrent SQLiteBatch
instances are allowed (#29112). Thus, the code rolls back the db txn within the SQLiteBatch
object destructor. This links the db txn to the SQLiteBatch
instance lifecycle, isolating data changes and preventing other SQLiteBatch
instances from dumping data that does not correspond to them.
But maybe this comment is missing some other information that you wanted to get across?
I wanted to emphasize the importance of rolling back the transaction by explaining the consequences of leaving the transaction dangling. Also, wanted to mention that the error could be transient (so we don't forget about this possibility) and describe why we want to keep the transaction data changes isolated within the SQLiteBatch
instance that initiated the transaction.
If, after reading this response, you still have doubts and find the comment confusing, I can replace it with your comment. No problem; the idea was to provide a clearer description of the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #29253 (comment)
Thanks for clarifying.
I wanted to emphasize the importance of rolling back the transaction by explaining the consequences of leaving the transaction dangling.
That seems ok. Would maybe suggest adding to the previously suggested comment:
- // If transaction cannot be aborted, it means there is a bug or there has been data corruption. Try to recover in this case by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction was opened will be rolled back and future transactions can succeed without committing old data.
Also, wanted to mention that the error could be transient (so we don't forget about this possibility)
This is exactly what I think the comment should not be saying because the error cannot be transient. And it's confusing to suggest that it could be transient, because it either makes the rest of the code seem broken, or it makes this code seem unnecessary.
If you want talk about a theoretical future state, in some future version of sqlite or future version of the wallet codebase where a rollback statement could return SQLITE_BUSY, I guess you could do that but it seems speculative and confusing if you don't distinguish that scenario with what is actually happening in the current code.
and describe why we want to keep the transaction data changes isolated within the
SQLiteBatch
instance that initiated the transaction.
I think that's mostly clear from the context but if you want to say more about it, the best place to put it would be on line 410 before the TxnAbort() call is made, to explain why the TxnAbort() call is being made and why it's important. Trying to get into why the rollback is important after the rolllback fails seems like going down a tangent and not putting information where it is most helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, applied your suggestion. Thanks!
This is exactly what I think the comment should not be saying because the error cannot be transient. And it's confusing to suggest that it could be transient, because it either makes the rest of the code seem broken, or it makes this code seem unnecessary.
I was also thinking about an I/O hardware malfunctioning error, which "could" be transient; sqlite has return codes for it. But it's probably better to abort the program if something like this ever happens.
493cfc7
to
756bed0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 756bed0. New update consolidating the exec classes looks good.
src/wallet/sqlite.cpp
Outdated
// If the transaction cannot be aborted, it may be due to the database connection being "busy", indicating a potential bug or data corruption. | ||
// In such cases, attempt recovery by closing and reopening the database. This action will automatically roll back the transaction, | ||
// ensuring that any changes made since the transaction was opened will be reverted. This prevents dangling transaction data | ||
// outside the context of the 'SQLiteBatch' that initiated such a transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #29253 (comment)
Thanks for clarifying.
I wanted to emphasize the importance of rolling back the transaction by explaining the consequences of leaving the transaction dangling.
That seems ok. Would maybe suggest adding to the previously suggested comment:
- // If transaction cannot be aborted, it means there is a bug or there has been data corruption. Try to recover in this case by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction was opened will be rolled back and future transactions can succeed without committing old data.
Also, wanted to mention that the error could be transient (so we don't forget about this possibility)
This is exactly what I think the comment should not be saying because the error cannot be transient. And it's confusing to suggest that it could be transient, because it either makes the rest of the code seem broken, or it makes this code seem unnecessary.
If you want talk about a theoretical future state, in some future version of sqlite or future version of the wallet codebase where a rollback statement could return SQLITE_BUSY, I guess you could do that but it seems speculative and confusing if you don't distinguish that scenario with what is actually happening in the current code.
and describe why we want to keep the transaction data changes isolated within the
SQLiteBatch
instance that initiated the transaction.
I think that's mostly clear from the context but if you want to say more about it, the best place to put it would be on line 410 before the TxnAbort() call is made, to explain why the TxnAbort() call is being made and why it's important. Trying to get into why the rollback is important after the rolllback fails seems like going down a tangent and not putting information where it is most helpful.
By encapsulating sqlite3_exec into its own standalone method and introducing the 'SQliteExecHandler' class, we enable the ability to test db statements execution failures within the unit test framework. This is used in the following-up commit to exercise a deadlock and improve our wallet db error handling code. Moreover, the future encapsulation of other sqlite functions within this class will contribute to minimize the impact of any future API changes.
Util function to clean up code and let us verify, in the following-up commit, that dangling, to-be-reverted db transactions cannot occur anymore.
If the handler that initiated the database transaction is destroyed, the ongoing transaction cannot be left dangling when the db txn fails to abort. It must be forcefully reversed; otherwise, any subsequent db handler executing a write operation will dump the dangling, to-be-reverted transaction data to disk. This not only breaks the database isolation property but also results in the improper storage of incomplete information on disk, impacting the wallet consistency.
756bed0
to
b298242
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated per feedback, thanks ryanofsky!
Docstrings were improved, and a bug related to a null SQLiteExecHandler
has been fixed.
Small diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK b298242. Just fix for exec result codes and comment update since last review.
Overall this change should help wallet recover and not cause more problems if a rollback fails, and it adds some nice test coverage.
@@ -61,6 +71,8 @@ class SQLiteBatch : public DatabaseBatch | |||
explicit SQLiteBatch(SQLiteDatabase& database); | |||
~SQLiteBatch() override { Close(); } | |||
|
|||
void SetExecHandler(std::unique_ptr<SQliteExecHandler>&& handler) { m_exec_handler = std::move(handler); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "sqlite: add ability to interrupt statements" (dca874e)
It's a little better to pass plain unique_ptr
instead of unique_ptr&&
, because former guarantees unique_ptr argument will be moved-from and null afterward, while latter means the move may or may not happen depending on internals of the function. In general you want explicit && references for template programming and perfect forwarding, but should use values and & references otherwise.. Can see https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rf-conventional for guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok noted, thanks!
If we do have recoverable errors, then we should be handling them, not hitting them with the hammer of closing and reopening the database. This PR feels like it's the wrong approach to dealing with this particular database failure. It shouldn't be possible for a transaction commit or abort to currently fail in way that isn't catastrophic. In that case, we should be shutting the wallet rather than reopening the database connection and pretending that nothing happened. This could potentially lead to a state where the wallet's in-memory state does not match the database, which is really bad and could lead to loss of funds. |
It seems like if there is failure to rollback a transaction one way, with a rollback statement, this PR is just retrying again a different way, by closing and reopening the database. In theory the code could always roll back transactions by closing and reopening the database, so the new behavior in this PR doesn't seem worse than current behavior. One way the PR seems better than the current code is that the current code just logs "SQLiteBatch: Batch closed and failed to abort transaction" if the rollback fails, which is not even an obvious error message, and it continues like nothing happened. The new code at least raises an exception if the rollback fails. All of this seems very theoretical to me since I have no idea why a rollback statement would ever fail. But I think this PR seems reasonable, adds some test coverage, and probably is a small improvement even if better approaches are also possible. |
ACK b298242 |
Github-Pull: bitcoin#29253 Rebased-From: fdf9f66
By encapsulating sqlite3_exec into its own standalone method and introducing the 'SQliteExecHandler' class, we enable the ability to test db statements execution failures within the unit test framework. This is used in the following-up commit to exercise a deadlock and improve our wallet db error handling code. Moreover, the future encapsulation of other sqlite functions within this class will contribute to minimize the impact of any future API changes. Github-Pull: bitcoin#29253 Rebased-From: dca874e
Util function to clean up code and let us verify, in the following-up commit, that dangling, to-be-reverted db transactions cannot occur anymore. Github-Pull: bitcoin#29253 Rebased-From: 472d2ca
If the handler that initiated the database transaction is destroyed, the ongoing transaction cannot be left dangling when the db txn fails to abort. It must be forcefully reversed; otherwise, any subsequent db handler executing a write operation will dump the dangling, to-be-reverted transaction data to disk. This not only breaks the database isolation property but also results in the improper storage of incomplete information on disk, impacting the wallet consistency. Github-Pull: bitcoin#29253 Rebased-From: fc0e747
Github-Pull: bitcoin#29253 Rebased-From: b298242
Github-Pull: bitcoin#29253 Rebased-From: b298242
Discovered while was reviewing #29112, specifically #29112 (review).
If the db handler that initiated the database transaction is destroyed,
the ongoing transaction cannot be left dangling when the db txn fails
to abort. It must be forcefully reverted; otherwise, any subsequent
db handler executing a write operation will dump the dangling,
to-be-reverted transaction data to disk.
This not only breaks the isolation property but also results in the
improper storage of incomplete information on disk, impacting
the wallet consistency.
This PR fixes the issue by resetting the db connection, automatically
rolling back the transaction (per https://www.sqlite.org/c3ref/close.html)
when the handler object is being destroyed and the txn abortion failed.
Testing Notes
Can verify the failure by reverting the fix e5217fe and running the test.
It will fail without e5217fe and pass with it.