Optimistically write data to disk when batch loading data into the system #4996

Mytherin · 2022-10-14T13:54:31Z

This PR enables DuckDB to optimistically compress and write data to disk when bulk loading.

Previously, when loading data from e.g. a large CSV or Parquet file in a single transaction, data was first loaded entirely into the transaction local storage. When committing, the data would then be flushed out of transaction local storage and written to the actual database file on disk.

While this has the advantage that in case of a ROLLBACK or similar error during loading discarding of the transaction-local state is free, it comes with a big disadvantage. Namely, when loading data that is bigger than memory we have to make multiple round-trips to disk.

Previously, this is what would happen when loading more data than fits in memory in a single transaction:

First we load into transaction-local storage - writing data that does not fit into memory to temporary files
Then we read from transaction-local storage, reading data back in from those temporary files
Finally we compress the data and write it to the database file

Optimistic Streaming to Disk

In this PR, we instead optimistically write transaction-local data to the database file as it is appended to tables. For every row group that is appended (120K~ rows) we immediately compress the data and write it out to the database file. This allows for a full streaming load into the database file, and greatly speeds up loading speed when loading more data than fits in memory in a single transaction.

If the transaction is rolled back or aborted, the blocks that were pre-emptively written to disk are marked as unused and reclaimed by the system for use in subsequent writes. This might still cause the database file to grow temporarily, however, and may create gaps in the database file if there are multiple transactions writing at the same time with a subset of those transactions aborting. That space is not lost - however. It will be re-used by the system when new data is ingested.

The actual performance gain depends mostly on the speed of the storage. On my Macbook (with a very fast SSD) the performance difference is small (on the order of 10%~ faster). On a machine with a hard disk or slower SSD, the performance difference will be far larger.

Another benefit is that required disk space drops heavily, as we will no longer have to write uncompressed data to disk. Instead, the data will be directly compressed as it is written to the table.

SQLLogicTest: `concurrentloop`

As part of this PR, we expand the sqllogictest framework with two new loops: concurrentloop and concurrentforeach. These operate similarly to the existing loop and foreach - with one big difference: every iteration of the loop is executed in parallel using separate threads that each have their own connection to the database. Example:

statement ok
CREATE TABLE integers(i INTEGER)

concurrentloop threadid 0 20

statement ok
INSERT INTO integers SELECT * FROM range(100);

endloop

query II
SELECT COUNT(*), SUM(i) FROM integers
----
2000	99000

We have several tests that test this type of behavior written in C++, but adding this functionality to the sqllogictest makes it significantly easier to write tests for the multiple-connection scenario, which should also enable us to write many more of them.

…oint)

…allel

…ng tasks, otherwise if all tasks finish before the event is scheduled weird things happen

…ge of existing partial blocks

…ger store a pointer to a specific row group as it might become invalidated if we run e.g. an alter type

…hing

…vague hope that this will exploit some logic to better spill out to disk duckdb/duckdb#4996

Mytherin added 30 commits October 11, 2022 14:25

Extract Writer requirement from Checkpoint

124845d

Split RowGroup::Checkpoint into two functions (WriteToDisk and checkp…

00495ed

…oint)

Initial version of optimistic write working

a95efa3

More tests

82deeb8

Correctly reclaim space after rollback of eager write to disk

30238e8

Optimistic writes and deletes

5cbd5ac

Delete, update & abort tests

e8baeed

Add tests + fixes for ALTER TYPE on optimistically written data

1a14426

Ensure temporary tables are not written to storage

63ac8b7

Also don't flush for in-memory systems

28c0aa1

Optimistic writes with indexes

5e03bc9

Merge branch 'master' into flushlocaltodisk2

887c0a6

Avoid unnecessary atomic loads

fae9d57

Add support for concurrentloop and concurrentforeach

33c4b49

SQLLogicTest rework WIP: add logger, and improve support for parallelism

e2e5f1c

Test runner: avoid exiting the program while tests are running in par…

2992900

…allel

Concurrent append tests using sqllogic test

b7b5bf2

Concurrent batch appends

53e1f6a

Add locks to single file block manager

64797e4

Fix data-race in ungrouped COUNT(DISTINCT): set event BEFORE scheduli…

1c0e74a

…ng tasks, otherwise if all tasks finish before the event is scheduled weird things happen

Fixes for single-file compilation

61189f0

In LocalStorage, flush last row group to disk as well to take advanta…

b99c2b6

…ge of existing partial blocks

Clean-up includes: avoid including local storage in transaction.hpp

1b04ba6

Cyclic insertions working

010be5a

Rework the way in which the local storage flushes row groups - no lon…

337a97f

…ger store a pointer to a specific row group as it might become invalidated if we run e.g. an alter type

Generate compression_types when required instead of unnecessarily cac…

957f515

…hing

skip_reload for tests with temporary tables

3de1f71

Interleaved appends no longer happen

24bcba6

Move to slow tests

544fe18

Missing ICU include

e072131

Increment SQLite scanner version

77db580

Mytherin merged commit 91ed0af into duckdb:master Oct 15, 2022

Mytherin mentioned this pull request Oct 25, 2022

CSV loading start using swap #321

Closed

This was referenced Nov 10, 2022

Parallel CSV Reader #5194

Merged

OOM when reading Parquet file #3969

Closed

Mytherin deleted the flushlocaltodisk2 branch January 7, 2023 14:58

sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this pull request May 21, 2023

Materialize the huge Biostatistics tests JSON into a DuckDB table in …

17cffa1

…vague hope that this will exploit some logic to better spill out to disk duckdb/duckdb#4996

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimistically write data to disk when batch loading data into the system #4996

Optimistically write data to disk when batch loading data into the system #4996

Uh oh!

Mytherin commented Oct 14, 2022

Uh oh!

Uh oh!

Optimistically write data to disk when batch loading data into the system #4996

Optimistically write data to disk when batch loading data into the system #4996

Uh oh!

Conversation

Mytherin commented Oct 14, 2022

Optimistic Streaming to Disk

SQLLogicTest: concurrentloop

Uh oh!

Uh oh!

SQLLogicTest: `concurrentloop`