Fix checkpoint deadlock - re-use locks on the same table #13712

Mytherin · 2024-09-03T10:15:29Z

This PR fixes an issue where checkpoints together with concurrent queries that referenced the same table multiple times could lead to deadlocks.

The issue was as follows:

When doing a checkpoint of a table, we would grab an exclusive lock for a given table while writing the data for that table.
Grabbing an exclusive lock meant waiting for all existing readers to finish, while not allowing new readers to start. This allows existing readers to finish, while guaranteeing progress for the checkpoint thread (new readers cannot continuously block the checkpoint)

This causes problems when we have a query that refers to the same table multiple times - as we would try to grab the same lock multiple times within the same transaction. The deadlock would happen as follows:

Transaction T1 grabs a read lock on table
Checkpointer grabs an exclusive lock on table, preventing new readers from starting
Transaction T1 tries to grab another read lock on table

This would then result in a deadlock, as T1 would wait need to wait until the checkpoint thread finishes - but the checkpoint thread needs to wait until T1 finishes.

This PR resolves the issue by keeping track of all active locks grabbed by a transaction, and ensuring we never grab the same shared lock multiple times in the same transaction.

…tables

…ck the same table within the same transaction, and instead re-use the existing lock

Merge pull request duckdb/duckdb#13712 from Mytherin/checkpointdeadlock Merge pull request duckdb/duckdb#13710 from lnkuiper/tsan_suppress_jemalloc

Merge pull request duckdb/duckdb#13712 from Mytherin/checkpointdeadlock Merge pull request duckdb/duckdb#13710 from lnkuiper/tsan_suppress_jemalloc Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

…ce (#14379) This is a follow-up PR from #13712 that removes the code-paths that grabbed the table lock through a different manner, avoiding the transaction-level interface. Looking at the code this seems to only be used in the WAL - so it's unlikely this would actually cause problems, but it's better to clean up this code path entirely so it doesn't get accidentally used in the future.

This is a follow-up to #13712, using a more fine-granular locking approach. #13712 has fixed an issue where checkpoints together with concurrent queries that referenced the same table multiple times could lead to deadlocks. The locking approach in that PR used a transaction-level lock to coordinate lock acquisition for the same table. In case of a query having many table scans (for different tables), this could potentially now block all executor threads if only a single table lock could not be acquired (e.g. due to a concurrent checkpoint). One thread would be blocked trying to acquire the checkpoint lock for the given table, and many others on the active_locks_lock (even though they would only be interested in acquiring the checkpoint lock for another table that wasn't being checkpointed).

Mytherin added 3 commits September 3, 2024 10:53

Fix checkpoint deadlock when different threads are reading different …

2cd654f

…tables

Fix checkpoint deadlock in a different manner - ensure we don't re-lo…

93833df

…ck the same table within the same transaction, and instead re-use the existing lock

GetExclusiveLock with timeout is no longer necessary

3aa957b

Mytherin merged commit 74c9f4d into duckdb:main Sep 3, 2024
39 checks passed

github-actions bot mentioned this pull request Sep 9, 2024

chore: Update vendored sources to duckdb/duckdb@74c9f4df1fe5c3f39007aa38c112cb7582f91302 duckdb/duckdb-r#343

Merged

Mytherin deleted the checkpointdeadlock branch September 17, 2024 08:09

ywelsch mentioned this pull request Oct 15, 2024

Use table-level locking when acquiring shared locks #14370

Merged

Mytherin mentioned this pull request Oct 15, 2024

Table locks - always grab table locks through the transaction interface #14379

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix checkpoint deadlock - re-use locks on the same table #13712

Fix checkpoint deadlock - re-use locks on the same table #13712

Uh oh!

Mytherin commented Sep 3, 2024

Uh oh!

Uh oh!

Uh oh!

Fix checkpoint deadlock - re-use locks on the same table #13712

Fix checkpoint deadlock - re-use locks on the same table #13712

Uh oh!

Conversation

Mytherin commented Sep 3, 2024

Uh oh!

Uh oh!

Uh oh!