Skip to content

Conversation

samansmink
Copy link
Contributor

In v1.1.0, the addition of the C Extension API has caused the extension binaries to increase quite significantly:

name v1.0.0 uncompressed v1.1.0 uncompressed
json 29M 77M
spatial 55M 110M
sqlite 27M 77M
prql 18M 40M

The reason is that due to the way we build extensions, we statically link duckdb in its entirety into the extension binary, then rely on the linker being able to prune the unused parts. What this means is that its quite easy unknowingly link in significant parts of DuckDB causing the extension binary to significantly grow.

This is precisely what happened when someone (possibly me) wrote some code that produces a struct containing all C API function pointers. Since the C API touches a lot of things throughout DuckDB this prevented the linker from being able to prune much.

The solution is to generate the struct of function pointers centrally in the DatabaseInstance, then hand out pointers to those.

@Mytherin I think we have 2 options: either put this behind a lock and lazily initialize, or initialize this in DatabaseInstance::Initialize then pass const pointers, I went with the latter.

Also I added a regression test for this that tests the size of [json,parquet,tpch,tpcds]. (theshold for check failure is currently at 20%).

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 1, 2024 17:20
@samansmink samansmink marked this pull request as ready for review October 1, 2024 18:16
Copy link
Contributor

@carlopi carlopi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a note on improving the test (that currently is 1-1, while in this PR should still be significant, see https://github.com/duckdb/duckdb/actions/runs/11131102326/job/30931920185?pr=14185), and another one on the fact that GetAPIV0 is basically constexpr, so I would assume basically has no cost (and we expect you to use it)

Thanks!!

Copy link
Contributor

@carlopi carlopi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's a "request changes", due to the test that is not super informative otherwise

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 2, 2024 08:22
@samansmink samansmink marked this pull request as ready for review October 2, 2024 08:22
@Mytherin
Copy link
Collaborator

Mytherin commented Oct 3, 2024

Thanks for the PR! Great find.

If I understand correctly the problem is that CreateAPIv0 is called in a static method (GetAPI) - when elsewhere we get a function pointer to that static method (in CreateAccessStruct) - likely preventing that call from being optimized out - causing the entire C API to always be statically linked in/never pruned.

Given that this is the problem - I wonder if we can simplify this solution. Do we need to hand out pointers to the struct that resides in the DatabaseInstance, or could we also just make a copy? That would avoid the need for locking and avoids introducing additional unnecessary lifetime problems.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 4, 2024 08:02
@samansmink
Copy link
Contributor Author

@Mytherin do you mean like this? (7edb53d)

I think the convention should be that extensions copy the struct over into some global variable anyway, at least thats what the macro's in the duckdb_extensions.h do anyway. However I could see how leaving the pointers valid could results in people incorrectly assuming they are always valid and lifetime issues arising. To resolve that I have made an extra copy that ensures the pointer to the function pointer struct that the extension sees during initialization will only be valid during the initialization forcing the extension to copy it.

@Mytherin Mytherin marked this pull request as ready for review October 4, 2024 08:51
@carlopi
Copy link
Contributor

carlopi commented Oct 8, 2024

@Mytherin: I think this is ready

Results (as per the added check) are:

- checking 'tpch': old size=41925566, new_size=14761342
- checking 'json': old size=41906006, new_size=34334838
- checking 'parquet': old size=41953918, new_size= 36586230
- checking 'tpcds': old size=41962926, new_size=16051550

@samansmink
Copy link
Contributor Author

1 sec, I will try to refactor it so that we don't need to store the struct in the databaseinstance.

I think either through using a virtual method or storing the function pointer to the CreateAPIv0() in the databaseinstance that should work

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 8, 2024 13:26
@samansmink samansmink marked this pull request as ready for review October 8, 2024 13:29
@samansmink
Copy link
Contributor Author

@Mytherin Tried making a virtual DatabaseInstance::GetExtensionAPIV0() method, but that resulting in some weird crashes on loading extensions.

Went with the function pointer approach, which seems clean enough to me

@Mytherin
Copy link
Collaborator

Mytherin commented Oct 8, 2024

That's cleaner - but I wonder if this does not also work? In that case we would not need to store anything in the database instance.

const duckdb_ext_api_v0 DatabaseInstance::GetExtensionAPIV0() {
	return CreateAPIv0();
}

@samansmink
Copy link
Contributor Author

No that will get the size right back up where it was

@Mytherin
Copy link
Collaborator

Mytherin commented Oct 8, 2024

Alright, let's get this in as-is then. Thanks!

@samansmink
Copy link
Contributor Author

samansmink commented Oct 8, 2024

Let's do a final double check for the binary sizes before merge. I checked locally, but better safe than sorry

link to relevant ci job

@samansmink
Copy link
Contributor Author

sizes seem all good:

 - checking 'tpch': old size=41913902, new_size=14765854
 - checking 'json': old size=41894342, new_size=34335206
 - checking 'parquet': old size=41942070, new_size=36578406
 - checking 'tpcds': old size=41951254, new_size=16051966

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 9, 2024 19:51
@samansmink samansmink marked this pull request as ready for review October 9, 2024 19:51
@Mytherin Mytherin merged commit 52b19d5 into duckdb:main Oct 11, 2024
43 checks passed
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Oct 16, 2024
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Oct 16, 2024
Fix extension size increase (duckdb/duckdb#14185)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants