Skip to content

Conversation

peterboncz
Copy link
Contributor

@peterboncz peterboncz commented Mar 14, 2025

binary serializer asserts tags and keys to be unique and this assert was violated by the new compressed vector serialization

many apologies for assuming that this had been tested by DataChunk::Verify() -- it was not because test runs still use v0.10.2 as standard serialization compatibility.

This came out in integration testing in MotherDuck, where we have now advanced to v1.2.0 (note that this would still not trigger the new code as that would happen in v1.3.0 -- but in the MotherDuck cherry-pick of this PR we have switched it off by default but can enable it with a extension SET variable -- for testing purpose).

fix:

  • use different keys (200,201,202,203) for the properties of compressed vectors
  • remove the -optional- vector_type serialization for flat vectors (because it then would repeat for constant and dictionary vectors)

In addition, I enabled the use of compressed vectors in composed types array, list and struct.

peter and others added 4 commits March 7, 2025 22:37
assert was violated by the new compressed vector serialization
(WIP: investigating why this was not detected during testing..)

fix:
- use different keys (200,201,202,203) for the properties of
  compressed vectors
- remove the -optional- vector_type serialization for flat vectors
  (because it then would repeat for constant and dictionary vectors)
@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 14, 2025 12:32
@peterboncz peterboncz marked this pull request as ready for review March 14, 2025 12:59
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I think this is still not quite correct though:

@@ -1352,7 +1351,7 @@ void Vector::Serialize(Serializer &serializer, idx_t count, bool compressed_seri
void Vector::Deserialize(Deserializer &deserializer, idx_t count) {
auto &logical_type = GetType();
const auto vtype = // older versions that only supported flat vectors did not serialize vector_type,
deserializer.ReadPropertyWithExplicitDefault<VectorType>(99, "vector_type", VectorType::FLAT_VECTOR);
deserializer.ReadPropertyWithExplicitDefault<VectorType>(200, "vector_type", VectorType::FLAT_VECTOR);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For flat vectors we are now reading 200 followed by reading 100 - this is incorrect (these should always be incremental).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange: had not run into any assert on that

but I can bring the numbering down to e.g. 90

Copy link
Contributor Author

@peterboncz peterboncz Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, renumbered from 90 on. It uses shouldSerialize(5), which would be the serialization compatibility version v1.3.0 -- which does not exist yet (therefore my bug did not do damage, pfweh).

Because I considered myself not to be in the position to add that serialization mode at the time, I did not do that. If you want me to, I could also make that change.

Copy link
Contributor

@Tishj Tishj Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also adding compatibility version v1.3.0 here
Along with some changes to the benchmark runner and the unit tester to support specifying the storage version for the db - as we default to storage compatibility version 0.10.2 I believe.

Perhaps we should move those changes to a separate PR as the DICT_FSST stuff is not quite ready to be merged yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterboncz / @Tishj : #16800 is now on main, and added 1.3.0 version, so that could now be used

Copy link
Contributor Author

@peterboncz peterboncz Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @carlopi - I think that this PR is still good with shouldSerialize(5) then (which is v1.3).
Thanks all for creating that.

(BTW - @Tishj great work on FSST_DICT! Omid A. at CWI has plans to push FSST into sink ops..)

@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 17, 2025 23:07
@peterboncz peterboncz marked this pull request as ready for review March 19, 2025 18:53
@peterboncz peterboncz changed the title make sure tags/keys do not repeat in compressed vector serialization compressed vector serialization fixes Mar 21, 2025
@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 21, 2025 13:15
and one indentation level less for the code (is a win ;-)
@peterboncz peterboncz marked this pull request as ready for review March 23, 2025 15:24
Mytherin added a commit that referenced this pull request Mar 25, 2025
Three changes:
* cherry-picking #16799 already
PR'd against `v1.2-histrionicus`
* add storage_version 66 alias `v1.3.0`
* change the results of `SELECT tags['storage_version'] FROM
duckdb_databases()` so that only lower bound is returned
this should make so that result is stable across duckdb versions, and as
a side effect tests will also be valid across duckdb version
unsure if adding a table function with the whole mapping might be of
use, that would allow via SQL to reconstruct the full bounds.

This PR do not introduce any change that is actually backward
incompatible and requiring a version bump, but there are in-flight PRs
such as #15637 or
#16648 that can then make use of
this
@Mytherin Mytherin merged commit 6eb21b1 into duckdb:main Mar 26, 2025
49 of 50 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants