Parquet writer - reduce memory usage of order-preserving write #10756

Mytherin · 2024-02-19T19:16:06Z

This PR reduces the maximum memory usage of writing Parquet files with order-preservation enabled in parallel.

The way the order-preserving parallel Parquet writer works is that it materializes batches of data based on their batch index, then repartitions them into the desired row group sizes, and finally writes the batches out to disk (implemented in #7375).

The problem here is that writing is disconnected from the materialization of batches - and only a single thread can write to the file at the same time. If the other threads materialize data faster than the single thread can write, memory usage will keep on growing, as the other threads are not prevented from accumulating more data in their buffers. In this situation, memory can keep on increasing until the entire Parquet file is read. This is a problem for Parquet files whose uncompressed contents do not fit in memory.

In this PR we address this issue by adding a backpressure mechanism to the batch copy. We use the temporary memory manager to reserve space for the materialized batches (with a maximum of 25% of the memory limit). As threads materialize data, we use the size of the respective ColumnDataCollections (through a new method AllocationSize) to keep track of how much memory we have gathered. If we exceed the available memory, we block the thread from continuing to materialize data. Halted threads may still help with repartitioning and preparing batches through ExecuteTasks, however.

The thread that is processing the minimum batch index can always continue - as the data that that thread is materializing can be written to disk immediately. Note that we do not yet deal with this in a nice way if a single batch index contains loads of data as we currently only repartition & flush after the batch index has been exhausted. However, this scenario is relatively rare in practice and can only really happen when streaming directly from Parquet files that have been written with very large row groups.

Max Threads

This PR adds a new MaxThreads function to the PhysicalSink. This can be used to limit the parallelism of a pipeline at the sink level. This is used for the order-preserving write as processing data with many threads and a low memory footprint is actually counter-productive and will lead to lower performance than using fewer threads, as well as a lower memory footprint. In the batch copy as a heuristic we require at least 4MB per column per thread of available memory. The amount of threads can be scaled down automatically when too little memory is available.

Tests & Benchmarks

Below are some tests and benchmarks:

Lineitem SF10 Parquet (2.5GB compressed, 5.4GB uncompressed, 10GB CSV)

Memory Limit	v0.10.0 (10T)	New	1 Thread	preserve_insertion_order=false
500MB	OOM	30.5s	57s	8.7s
1GB	OOM	17.2s	57s	8.7s
2GB	OOM	12.3s	57s	8.7s
3GB	OOM	10.8s	57s	8.7s
4GB	8.5s	9.5s	57s	8.7s
5GB	8.7s	9.0s	57s	8.7s

ClickBench Hits Parquet (14GB compressed, 36GB uncompressed, 80GB CSV)

Memory Limit	v0.10.0 (10T)	New	1 Thread	preserve_insertion_order=false
2GB	OOM	OOM	397s	63s
4GB	OOM	197s	397s	63s
8GB	OOM	168s	397s	63s
16GB	OOM	143s	397s	63s
32GB	OOM	113s	397s	63s
64GB	OOM	93s	397s	63s
128GB	130s	98s	397s	63s

…threads if the total amount of cached data is exceeded

…g tasks

… threads waiting to execute tasks

… on available memory

…en it is no longer being granted

…buffers in case we are re-initializing the same append state across column data collections

Merge pull request duckdb/duckdb#10658 from hannes/csvpathlength Merge pull request duckdb/duckdb#10756 from Mytherin/preserveinsertionordermemory Merge pull request duckdb/duckdb#10746 from samansmink/enable-azure-autoload Merge pull request duckdb/duckdb#10747 from maiadegraaf/list_reverse_bug Merge pull request duckdb/duckdb#10748 from taniabogatsch/capi-tests Merge pull request duckdb/duckdb#10739 from peterboncz/pb/immmedate_mode_only_in_non_autocommit Merge pull request duckdb/duckdb#10688 from Tmonster/union_exclude Merge pull request duckdb/duckdb#10710 from samansmink/comment-on-column Merge pull request duckdb/duckdb#10725 from hawkfish/fuzzer-preceding-frame Merge pull request duckdb/duckdb#10723 from hawkfish/fuzzer-null-timestamp Merge pull request duckdb/duckdb#10436 from taniabogatsch/map-fixes Merge pull request duckdb/duckdb#10587 from kryonix/main Merge pull request duckdb/duckdb#10738 from TinyTinni/fix-assert-in-iscntrl Merge pull request duckdb/duckdb#10708 from carlopi/ci_fixes Merge pull request duckdb/duckdb#10726 from hawkfish/fuzzer-to-weeks Merge pull request duckdb/duckdb#10727 from hawkfish/fuzzer-window-bind Merge pull request duckdb/duckdb#10733 from TinyTinni/remove-static-string Merge pull request duckdb/duckdb#10715 from Tishj/python_tpch_regression_rework Merge pull request duckdb/duckdb#10728 from hawkfish/fuzzer-argminmax-decimal Merge pull request duckdb/duckdb#10717 from carlopi/fix_extension_deploy Merge pull request duckdb/duckdb#10694 from Mytherin/castquerylocation Merge pull request duckdb/duckdb#10448 from peteraisher/feature/use-assertThrows-for-jdbc-tests Merge pull request duckdb/duckdb#10691 from Mytherin/issue10685 Merge pull request duckdb/duckdb#10684 from Mytherin/distincton Merge pull request duckdb/duckdb#9539 from Tishj/timestamp_unit_to_tz Merge pull request duckdb/duckdb#10341 from Tmonster/tpch_ingestion_benchmark Merge pull request duckdb/duckdb#10689 from Mytherin/juliaversion Merge pull request duckdb/duckdb#10669 from Mytherin/skippedtests Merge pull request duckdb/duckdb#10679 from Tishj/reenable_window_rows_overflow Merge pull request duckdb/duckdb#10672 from carlopi/wasm_extensions_ci Merge pull request duckdb/duckdb#10660 from szarnyasg/update-storage-info-for-v0100 Merge pull request duckdb/duckdb#10643 from bleskes/duck_transaction_o11y Merge pull request duckdb/duckdb#10654 from carlopi/fix_10548 Merge pull request duckdb/duckdb#10650 from hannes/noprintf Merge pull request duckdb/duckdb#10649 from Mytherin/explicitenumnumbers

Mytherin added 18 commits February 16, 2024 14:29

WIP memory usage in parallel order preserving parquet writer - pause …

54ca95c

…threads if the total amount of cached data is exceeded

Benchmark + let background threads work on flushing data to disk

3946d70

Use temporary memory manager

ef807df

Allow inactive threads to assist in processing repartitioning/flushin…

e4e0a93

…g tasks

Revert back to processing also if we get enough memory again

eadf847

Format fix

f97f22a

Add batched memory usage test

b8c9be5

Use GetSizeBytes from ColumnDataCollection, and keep it cached

d1abe06

Don't execute tasks in PhysicalFixedBatchCopy::NextBatch if there are…

60f97a5

… threads waiting to execute tasks

Cap the number of threads working on the PhysicalFixedBatchCopy based…

c9c8c9f

… on available memory

Correctly request for more memory, and stop requesting more memory wh…

8124e0a

…en it is no longer being granted

Determine minimum memory per thread based on column count

8c8cc40

Add lineitem test

9670e02

Fix test

f2c382c

Add more tests, and in InitializeAppend of CDC clear previously held …

3c379a8

…buffers in case we are re-initializing the same append state across column data collections

Extra test

1fdb5c8

Missing includes

f083012

ColumnDataCollection - ditinguish between AllocationSize and SizeInBytes

9a28f4c

Mytherin merged commit 0301182 into duckdb:main Feb 20, 2024

Mytherin mentioned this pull request Feb 20, 2024

Out of Memory writing to Parquet #10737

Closed

1 task

This was referenced Feb 23, 2024

Out of memory when processing 600M rows #10815

Closed

Writing parquet with large strings runs out of memory #10858

Closed

Mytherin mentioned this pull request Feb 27, 2024

Reduce memory usage & avoid spilling to disk unnecessarily for order-preserving table creation/insertion #10862

Merged

szarnyasg mentioned this pull request Mar 8, 2024

out of memory converting many large CSVs to parquet unless one single thread #11054

Closed

1 task

Mytherin mentioned this pull request Mar 15, 2024

Unify CSV/JSON and Parquet Batch Writing Code - and fix memory management issues in CSV/JSON writing #11188

Merged

Mytherin mentioned this pull request Apr 15, 2024

Limit batch insert threads based on available memory, similar to Parquet write #11655

Merged

Mytherin deleted the preserveinsertionordermemory branch July 5, 2024 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet writer - reduce memory usage of order-preserving write #10756

Parquet writer - reduce memory usage of order-preserving write #10756

Uh oh!

Mytherin commented Feb 19, 2024

Uh oh!

Uh oh!

Parquet writer - reduce memory usage of order-preserving write #10756

Parquet writer - reduce memory usage of order-preserving write #10756

Uh oh!

Conversation

Mytherin commented Feb 19, 2024

Max Threads

Tests & Benchmarks

Lineitem SF10 Parquet (2.5GB compressed, 5.4GB uncompressed, 10GB CSV)

ClickBench Hits Parquet (14GB compressed, 36GB uncompressed, 80GB CSV)

Uh oh!

Uh oh!