-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happens?
Inserting rows from a collection of Parquet files: INSERT INFO ... FROM (SELECT read_parquet(...))
fails:
After the progress bar reads 100%, this fails with a segmentation fault in RowVersionManager::CommitAppend.
Expected behaviour is that the statement is succesfull, or fails with an error message.
DuckDB v0.9.1, the cldebug build, running under gdb: (click to reveal)
(gdb) run occ.db
Starting program: /ddsf/duckdb occ.db
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
v0.9.1 401c8061c6
Enter ".help" for usage hints.
D
D SELECT COUNT(1) from read_parquet('data/*');
[New Thread 0x7ffff73ff640 (LWP 102197)]
[New Thread 0x7ffff6bfe640 (LWP 102198)]
[New Thread 0x7ffff63fd640 (LWP 102199)]
[New Thread 0x7ffff5bfc640 (LWP 102200)]
[New Thread 0x7ffff53fb640 (LWP 102201)]
[New Thread 0x7ffff4bfa640 (LWP 102202)]
[New Thread 0x7ffff43f9640 (LWP 102203)]
100% ▕████████████████████████████████████████████████████████████▏
┌────────────┐
│ count(1) │
│ int64 │
├────────────┤
│ 2613427383 │
└────────────┘
D
D create table "occ"("i" int8, "tax" int8, "lon" double, "lat" double);
D insert into "occ" by name (select "gbifid" "i", "specieskey" "tax", "decimallatitude" "lat", "decimallongitude" "lon" from read_parquet('data/*') where decimallattitude is not null and decimallongitude is not null and specieskey>0);
100% ▕████████████████████████████████████████████████████████████▏
Thread 1 "duckdb" received signal SIGSEGV, Segmentation fault.
0x0000555557f8b79a in duckdb::RowVersionManager::CommitAppend (this=0x55555b33c810, commit_id=72, row_group_start=0, count=0) at /home/gbsim/duckdb/
:124
124 info->CommitAppend(commit_id, vstart, vend);
(gdb) bt
#0 0x0000555557f8b79a in duckdb::RowVersionManager::CommitAppend (this=0x55555b33c810, commit_id=72,
row_group_start=0, count=0) at /home/gbsim/duckdb/src/storage/table/row_version_manager.cpp:124
#1 0x0000555557f83e2a in duckdb::RowGroup::CommitAppend (this=0x7fffd741c300, commit_id=72, row_group_start=0,
count=0) at /home/gbsim/duckdb/src/storage/table/row_group.cpp:641
#2 0x0000555557f87f98 in duckdb::RowGroupCollection::CommitAppend (this=0x55555af15780, commit_id=72, row_start=0,
count=2366250032) at /home/gbsim/duckdb/src/storage/table/row_group_collection.cpp:419
#3 0x0000555557ffa048 in duckdb::DataTable::CommitAppend (this=0x55555aee46b0, commit_id=72, row_start=0,
count=2366250032) at /home/gbsim/duckdb/src/storage/data_table.cpp:819
#4 0x0000555558050a6a in duckdb::CommitState::CommitEntry<true> (this=0x7fffffff7d50,
type=duckdb::UndoFlags::INSERT_TUPLE, data=0x7ffff7fb5008 "\260F\356ZUU")
at /home/gbsim/duckdb/src/transaction/commit_state.cpp:274
#5 0x000055555804c4b1 in operator() (__closure=0x7fffffff7d48, type=duckdb::UndoFlags::INSERT_TUPLE,
data=0x7ffff7fb5008 "\260F\356ZUU") at /home/gbsim/duckdb/src/transaction/undo_buffer.cpp:142
#6 0x000055555804f65e in duckdb::UndoBuffer::IterateEntries<duckdb::UndoBuffer::Commit(duckdb::UndoBuffer::IteratorState&, duckdb::optional_ptr<duckdb::WriteAheadLog>, duckdb::transaction_t)::<lambda(duckdb::UndoFlags, duckdb::data_ptr_t)> >(duckdb::UndoBuffer::IteratorState &, struct {...} &&) (this=0x55555aeda0e0, state=..., callback=...)
at /home/gbsim/duckdb/src/transaction/undo_buffer.cpp:45
#7 0x000055555804c594 in duckdb::UndoBuffer::Commit (this=0x55555aeda0e0, iterator_state=..., log=..., commit_id=72)
at /home/gbsim/duckdb/src/transaction/undo_buffer.cpp:142
#8 0x000055555804b485 in duckdb::DuckTransaction::Commit[abi:cxx11](duckdb::AttachedDatabase&, unsigned long, bool) (
this=0x55555aeda060, db=..., commit_id=72, checkpoint=true)
at /home/gbsim/duckdb/src/transaction/duck_transaction.cpp:133
#9 0x000055555804a2c4 in duckdb::DuckTransactionManager::CommitTransaction[abi:cxx11](duckdb::ClientContext&, duckdb::Transaction*) (this=0x55555ae095d0, context=..., transaction_p=0x55555aeda060)
at /home/gbsim/duckdb/src/transaction/duck_transaction_manager.cpp:215
#10 0x000055555804bbf6 in duckdb::MetaTransaction::Commit[abi:cxx11]() (this=0x55555aef0800)
at /home/gbsim/duckdb/src/transaction/meta_transaction.cpp:60
#11 0x000055555804cc19 in duckdb::TransactionContext::Commit (this=0x55555ae09880)
at /home/gbsim/duckdb/src/transaction/transaction_context.cpp:48
#12 0x0000555557ce2a1e in duckdb::ClientContext::EndQueryInternal (this=0x55555ae09690, lock=..., success=true,
invalidate_transaction=false) at /home/gbsim/duckdb/src/main/client_context.cpp:182
#13 0x0000555557ce2fe0 in duckdb::ClientContext::CleanupInternal (this=0x55555ae09690, lock=...,
result=0x55555afa19a0, invalidate_transaction=false) at /home/gbsim/duckdb/src/main/client_context.cpp:216
#14 0x0000555557ce1e6e in duckdb::ClientContext::FetchInternal (this=0x55555ae09690, lock=..., executor=...,
result=...) at /home/gbsim/duckdb/src/main/client_context.cpp:101
#15 0x0000555557ce36d6 in duckdb::ClientContext::FetchResultInternal (this=0x55555ae09690, lock=..., pending=...)
at /home/gbsim/duckdb/src/main/client_context.cpp:272
#16 0x0000555557cf9011 in duckdb::PendingQueryResult::ExecuteInternal (this=0x55555ae7f280, lock=...)
at /home/gbsim/duckdb/src/main/pending_query_result.cpp:64
#17 0x0000555557cf90d3 in duckdb::PendingQueryResult::Execute (this=0x55555ae7f280)
at /home/gbsim/duckdb/src/main/pending_query_result.cpp:71
#18 0x0000555557cf9a1a in duckdb::PreparedStatement::Execute (this=0x55555af166e0, values=...,
allow_stream_result=false) at /home/gbsim/duckdb/src/main/prepared_statement.cpp:85
#19 0x00005555576a8c8d in duckdb_shell_sqlite3_print_duckbox (pStmt=0x55555ae80f20, max_rows=40, max_width=0,
null_value=0x7fffffffce54 "", columnar=0)
at /home/gbsim/duckdb/tools/sqlite3_api_wrapper/sqlite3_api_wrapper.cpp:244
#20 0x000055555768b2fa in exec_prepared_stmt (pArg=0x7fffffffcd40, pStmt=0x55555ae80f20)
at /home/gbsim/duckdb/tools/shell/shell.c:12901
#21 0x000055555768c146 in shell_exec (pArg=0x7fffffffcd40,
zSql=0x55555ae9f120 "insert into \"occ\" by name (select \"gbifid\" \"i\", \"specieskey\" \"tax\", \"decimallatitude\" \"lat\", \"decimallongitude\" \"lon\" from read_parquet('data/*') where decimallatitude is not null and decimallongitude"..., pzErrMsg=0x7fffffffcbb8) at /home/gbsim/duckdb/tools/shell/shell.c:13236
#22 0x0000555557699af3 in runOneSqlLine (p=0x7fffffffcd40,
zSql=0x55555ae9f120 "insert into \"occ\" by name (select \"gbifid\" \"i\", \"specieskey\" \"tax\", \"decimallatitude\" \"lat\", \"decimallongitude\" \"lon\" from read_parquet('data/*') where decimallatitude is not null and decimallongitude"..., in=0x0, startline=4) at /home/gbsim/duckdb/tools/shell/shell.c:19654
#23 0x000055555769a09f in process_input (p=0x7fffffffcd40) at /home/gbsim/duckdb/tools/shell/shell.c:19772
#24 0x000055555769bf4d in main (argc=2, argv=0x7fffffffe078) at /home/gbsim/duckdb/tools/shell/shell.c:20585
To Reproduce
I have not succeeded in reproducing this with synthetic parquet data, or by just selecting one of the columns, but can reproduce this issue like this:
I'm using DuckDB 0.9.1 from the git repository (git checkout v0.9.1), running on a x86_64 virtual server with 32 GB RAM and 8 virtual cores.
-
wget https://gbif-open-data-eu-central-1.s3.amazonaws.com/occurrence/2023-11-01/occurrence.parquet/{000000...002459}
to download the dataset to a local directory; this presumes bash-like expansion to download 2460 parquet files totaling 2613427383 = 1.2 · 231 rows, a 210 GB download. The resulting DuckDB database, after discarding most columns, would be some 55 GB. -
Create a table:
CREATE TABLE "occ"("i" INT8, "tax" INT8, "lon" DOUBLE, "lat" DOUBLE);
-
(Now,
SELECT COUNT(1) from read_parquet('data/*');
should count 2613427383 rows.) -
The following statement triggers the stack overflow, after reading to 100%:
INSERT INTO "occ" BY NAME (
SELECT "gbifid" "i", "specieskey" "tax", "decimallatitude" "lat", "decimallongitude" "lon" from read_parquet('data/*')
WHERE "decimallatitude" IS NOT NULL AND "decimallongitude" IS NOT NULL AND "specieskey">0
);
Things I've experimented with:
- ❌
SET memory_limit='250MB'
, the same segmentation fault occurs. - ✔ The same INSERT statement, repeated (from a bash loop) for every of the parquet files. Succeeds.
- ✔
SET threads TO 1
, the statement succeeds. - ✔ Extracting only one column (the ID) to a one-column table, succeeds.
OS:
Ubuntu 22.04 on a x86_64 virtual server.
DuckDB Version:
git checkout v0.9.1 and v0.9.2-dev256 2e6a379
DuckDB Client:
included DuckDB shell
Full Name:
Thomas Wissen
Affiliation:
none
Have you tried this on the latest main
branch?
I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- Yes, I have