Skip to content

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Sep 3, 2024

This PR fixes an issue where checkpoints together with concurrent queries that referenced the same table multiple times could lead to deadlocks.

The issue was as follows:

When doing a checkpoint of a table, we would grab an exclusive lock for a given table while writing the data for that table.
Grabbing an exclusive lock meant waiting for all existing readers to finish, while not allowing new readers to start. This allows existing readers to finish, while guaranteeing progress for the checkpoint thread (new readers cannot continuously block the checkpoint)

This causes problems when we have a query that refers to the same table multiple times - as we would try to grab the same lock multiple times within the same transaction. The deadlock would happen as follows:

  • Transaction T1 grabs a read lock on table
  • Checkpointer grabs an exclusive lock on table, preventing new readers from starting
  • Transaction T1 tries to grab another read lock on table

This would then result in a deadlock, as T1 would wait need to wait until the checkpoint thread finishes - but the checkpoint thread needs to wait until T1 finishes.

This PR resolves the issue by keeping track of all active locks grabbed by a transaction, and ensuring we never grab the same shared lock multiple times in the same transaction.

@Mytherin Mytherin merged commit 74c9f4d into duckdb:main Sep 3, 2024
39 checks passed
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Sep 9, 2024
Merge pull request duckdb/duckdb#13712 from Mytherin/checkpointdeadlock
Merge pull request duckdb/duckdb#13710 from lnkuiper/tsan_suppress_jemalloc
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Sep 9, 2024
Merge pull request duckdb/duckdb#13712 from Mytherin/checkpointdeadlock
Merge pull request duckdb/duckdb#13710 from lnkuiper/tsan_suppress_jemalloc

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
@Mytherin Mytherin deleted the checkpointdeadlock branch September 17, 2024 08:09
Mytherin added a commit that referenced this pull request Oct 15, 2024
…ce (#14379)

This is a follow-up PR from #13712
that removes the code-paths that grabbed the table lock through a
different manner, avoiding the transaction-level interface. Looking at
the code this seems to only be used in the WAL - so it's unlikely this
would actually cause problems, but it's better to clean up this code
path entirely so it doesn't get accidentally used in the future.
Mytherin added a commit that referenced this pull request Oct 16, 2024
This is a follow-up to #13712,
using a more fine-granular locking approach.

#13712 has fixed an issue where checkpoints together with concurrent
queries that referenced the same table multiple times could lead to
deadlocks. The locking approach in that PR used a transaction-level lock
to coordinate lock acquisition for the same table.

In case of a query having many table scans (for different tables), this
could potentially now block all executor threads if only a single table
lock could not be acquired (e.g. due to a concurrent checkpoint). One
thread would be blocked trying to acquire the checkpoint lock for the
given table, and many others on the active_locks_lock (even though they
would only be interested in acquiring the checkpoint lock for another
table that wasn't being checkpointed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant