Skip to content

Conversation

Mytherin
Copy link
Collaborator

Follow-up from #11114

This PR adds a new PhysicalVerifyVector operator that is used for testing purposes. The operator emits exactly the rows it receives, but transforms them into a different shape. There are four configuration settings:

  • Dictionary: Transform every vector into a dictionary vector where the underlying vector has gaps and is reversed (similar to what is introduced in the ExpressionExecutor in Add Dictionary vector verification #11114)
original: FLAT [1, 2, 3]
modified: BASE: [NULL, 3, NULL, 2, NULL, 1]   OFFSETS: [5, 3, 1]
  • Constant: Decompose every DataChunk into single-row constant vectors
original: FLAT [1, 2, 3]
modified:
chunk #1 - CONSTANT [1]
chunk #2 - CONSTANT [2]
chunk #3 - CONSTANT [3]
  • Sequence & Constant: Decompose every DataChunk into constant or sequence vectors based on the longest possibility
original:  a: [1, 1, 20, 15, 13]   b: [1, 10, 100, 101, 102]
modified: 
chunk #1 - a: CONSTANT [1, 1]          b: DICTIONARY [1, 10]
chunk #2 - a: DICTIONARY [20, 15, 13]  b: SEQUENCE [100, 101, 102]
  • Nested Shuffle: Reshuffle list vectors so that offsets are not contiguous
original: [[1, 2], [3, 4]] - BASE: [1, 2, 3, 4] LISTS: [offset: 0, length: 2][offset: 2, length: 2]
modified: [[1, 2], [3, 4]] - BASE: [3, 4, 1, 2] LISTS: [offset: 2, length: 2][offset: 0, length: 2]

Usage

The VERIFY_VECTOR setting can be used to enable a specific configuration, e.g.:

VERIFY_VECTOR=dictionary_expression make debug
VERIFY_VECTOR=dictionary_operator make debug
VERIFY_VECTOR=constant_operator make debug
VERIFY_VECTOR=sequence_operator make debug
VERIFY_VECTOR=nested_shuffle make debug

@github-actions github-actions bot marked this pull request as draft March 13, 2024 19:25
@Mytherin Mytherin marked this pull request as ready for review March 13, 2024 19:26
@github-actions github-actions bot marked this pull request as draft March 13, 2024 19:29
@Mytherin Mytherin marked this pull request as ready for review March 13, 2024 19:29
@github-actions github-actions bot marked this pull request as draft March 13, 2024 19:34
@Mytherin Mytherin marked this pull request as ready for review March 13, 2024 19:34
@github-actions github-actions bot marked this pull request as draft March 13, 2024 20:26
@Mytherin Mytherin marked this pull request as ready for review March 13, 2024 20:26
@github-actions github-actions bot marked this pull request as draft March 13, 2024 21:45
@Mytherin Mytherin marked this pull request as ready for review March 13, 2024 21:45
@Maxxen
Copy link
Member

Maxxen commented Mar 13, 2024

Amazing!

@github-actions github-actions bot marked this pull request as draft March 14, 2024 08:28
@Mytherin Mytherin marked this pull request as ready for review March 14, 2024 08:30
@carlopi
Copy link
Contributor

carlopi commented Mar 14, 2024

This is cool!

One question: could it make sense to introduce a random/mixed mode where verification steps are added in all together?

This looks like madness since it becomes hardly reproducible, BUT on the other hand it would possibly help uncover even more cases by exploring the encoding space even more.

Or maybe could this mixed mode be controlled at the SQL level? Idea would be for all this logic to be exposed to duckdb, and the SQL side there could be some global configuration that control which verifiers to turn on at a given moment. I was thinking for example for the fuzzer, that instead of having to decide at compile time which mode to fuzz it can randomize at runtime doing a SET vector_verification='dictionary'; before issuing a given query.

This can also be a follow up, given this looks ready to be merged already.

@Mytherin
Copy link
Collaborator Author

Mytherin commented Mar 14, 2024

Making this an option with SET makes sense and is likely a better way of doing this. I thought about a random mode - I think that does make sense but the only problem is reproducibility as you mentioned. We could do this in the fuzzer but then would need some way of tagging issues as being "randomly reproducible" otherwise the fuzzer would open issues and then close them after not being able to reproduce them again.

@Mytherin Mytherin merged commit 09a8df0 into duckdb:main Mar 14, 2024
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 17, 2024
Merge pull request duckdb/duckdb#11138 from Mytherin/morevectorverification
Merge pull request duckdb/duckdb#11148 from carlopi/fix_upload_assets
@maiadegraaf maiadegraaf mentioned this pull request Mar 18, 2024
@Mytherin Mytherin deleted the morevectorverification branch June 7, 2024 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants