-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Requiring keys provided to map
to be unique
#3760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Maybe I should change the GetHistogramFunction so it doesn't always throw an exception if the type isn't supported so I can handle it at the call site and throw a more descriptive exception? |
Some of the CI tests are failing because of the Histogram aggregate issue I mentioned, might be worth it to temporarily create an inefficient brute check for uniqueness for those types? Or just allow non-unique keys for those types for now |
@pdet What do you think is the right course of action to take for now? Should I allow non-unique keys for types that aren't supported by the Histogram aggregate yet? |
What types are missing? Only nested types? |
I'm not sure, when I comment out the template <bool IS_ORDERED = true>
AggregateFunction GetHistogramFunction(const LogicalType &type) {
switch (type.id()) {
case LogicalType::BOOLEAN:
return GetMapType<HistogramFunctor, bool, IS_ORDERED>(type);
case LogicalType::UTINYINT:
return GetMapType<HistogramFunctor, uint8_t, IS_ORDERED>(type);
case LogicalType::USMALLINT:
return GetMapType<HistogramFunctor, uint16_t, IS_ORDERED>(type);
case LogicalType::UINTEGER:
return GetMapType<HistogramFunctor, uint32_t, IS_ORDERED>(type);
case LogicalType::UBIGINT:
return GetMapType<HistogramFunctor, uint64_t, IS_ORDERED>(type);
case LogicalType::TINYINT:
return GetMapType<HistogramFunctor, int8_t, IS_ORDERED>(type);
case LogicalType::SMALLINT:
return GetMapType<HistogramFunctor, int16_t, IS_ORDERED>(type);
case LogicalType::INTEGER:
return GetMapType<HistogramFunctor, int32_t, IS_ORDERED>(type);
case LogicalType::BIGINT:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::FLOAT:
return GetMapType<HistogramFunctor, float, IS_ORDERED>(type);
case LogicalType::DOUBLE:
return GetMapType<HistogramFunctor, double, IS_ORDERED>(type);
case LogicalType::VARCHAR:
return GetMapType<HistogramStringFunctor, string, IS_ORDERED>(type);
case LogicalType::TIMESTAMP:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIMESTAMP_TZ:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIMESTAMP_S:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIMESTAMP_MS:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIMESTAMP_NS:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIME:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIME_TZ:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::DATE:
return GetMapType<HistogramFunctor, int32_t, IS_ORDERED>(type);
default:
throw InternalException("Unimplemented histogram aggregate");
}
} @pdet TL;DR yes probably only nested types aren't supported |
I think it is fine if you just throw an error if it's not one of these types. |
That is also an option haha hadn't considered that one if I'm honest. |
@pdet Could you take a look at this, see if there's anything I still need to fix/change? |
src/function/scalar/map/map.cpp
Outdated
bool KeyListIsEmpty(list_entry_t *data, idx_t rows) { | ||
for (idx_t i = 0; i < rows; i++) { | ||
auto size = data[i].length; | ||
if (size != 0) { | ||
return false; | ||
} | ||
} | ||
return true; | ||
} | ||
|
||
static void CheckKeyValidity(DataChunk &args) { | ||
auto types = args.GetTypes(); | ||
if (types.empty()) { | ||
return; | ||
} | ||
auto key_type = ListType::GetChildType(types[0]).id(); | ||
auto &array = args.data[0]; | ||
auto arg_data = FlatVector::GetData<list_entry_t>(array); | ||
auto &entries = ListVector::GetEntry(array); | ||
if (key_type == LogicalTypeId::SQLNULL) { | ||
if (KeyListIsEmpty(arg_data, args.size())) { | ||
return; | ||
} | ||
// The entire key list is NULL for one (or more) of the rows: (ARRAY[NULL, NULL, NULL]) | ||
throw InvalidInputException("Map keys can not be NULL"); | ||
} | ||
|
||
VectorData list_data; | ||
auto count = ListVector::GetListSize(array); | ||
args.data[0].Orrify(args.size(), list_data); | ||
auto validity = FlatVector::Validity(entries); | ||
if (!validity.CheckAllValid(count)) { | ||
throw InvalidInputException("Map keys can not be NULL"); | ||
} | ||
} | ||
|
||
static void CheckForKeyUniqueness(DataChunk &args, ExpressionState &state) { | ||
// Create a copy of the arguments | ||
auto types = args.GetTypes(); | ||
if (types.empty() || ListType::GetChildType(types[0]).id() == LogicalType::SQLNULL) { | ||
return; | ||
} | ||
|
||
auto arg_data = FlatVector::GetData<list_entry_t>(args.data[0]); | ||
|
||
DataChunk keys; | ||
keys.Initialize(args.GetTypes()); | ||
args.Copy(keys); | ||
|
||
// Split the copy to separate the keys | ||
DataChunk remaining_columns; | ||
keys.Split(remaining_columns, 1); | ||
|
||
Vector unique_result(LogicalType::UBIGINT, args.size()); | ||
ListUniqueFunction(keys, state, unique_result); | ||
for (idx_t i = 0; i < args.size(); i++) { | ||
auto keys_length = arg_data[i].length; | ||
auto unique_keys = FlatVector::GetValue<uint64_t>(unique_result, i); | ||
if (unique_keys != keys_length) { | ||
throw InvalidInputException("Map keys have to be unique!"); | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking, wouldn't it be better if these are
AreKeysUnique and AreKeysValid functions that return a bool with the exception being thrown in the MapFunction?
Also, how are these functions called when you have a map type not created by SQL?
e.g., if you have a malformed pyarrow table with a map type and then consume it through duckdb, I think these functions won't be called, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that could definitely be the case, I will test those cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map_type = pa.map_(pa.int32(), pa.int32())
values = [
[
(3, 12),
(3, 21)
],
[
(5, 42)
]
]
arrow_table = pa.table(
{'detail': pa.array(values, map_type)}
)
rel = duckdb.from_arrow(arrow_table).fetchall()
[({'key': [3, 3], 'value': [12, 21]},), ({'key': [5], 'value': [42]},)]
Your hunch was correct, this error isn't caught
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the error is caught and tested for 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Some more comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Looking great. Some more comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Almost there. Some more pointers:
src/function/scalar/map/map.cpp
Outdated
|
||
namespace duckdb { | ||
|
||
// TODO: this doesn't recursively verify maps if maps are nested | ||
void VerifyMap(Vector &map, idx_t count, const SelectionVector &sel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split this function into a function that returns an enum class MapInvalidReason : uint8_t { MAP_VALID, MAP_NULL_KEY_LIST, MAP_NULL_KEY, MAP_NOT_UNIQUE };
and a function that throws the InvalidInputExceptions
?
in Vector::Verify
we want to throw an InternalException
if the map is not valid, and not an InvalidInputException
. We want to do something like:
D_ASSERT(VerifyMapInternal(map, count, sel) == MapInvalidReason::MAP_VALID);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
I turned VerifyMap into a static method of Vector, that just asserts that it's valid
CheckMapValidity now returns the enum
and I've added two functions that handle the errors
One in arrow_conversion.cpp and one in map.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the ones in Arrow conversion should never be hit, because the only thing we don't share is the unique requirement, but I added them for completeness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates - last comment, then it is good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks, now all that's left is to pray to the CI gods to be merciful |
Looks like they were. Great, thanks! |
This PR adds a requirement to the
map
function as discussed in #3640It uses the
list_unique
function to verify that all keys are unique, that's why it is also subject to the same limitations as list_unique, being that the type provided has to be supported by the histogram aggregrate function.That's why I've changed that exception type in
GetHistogramFunction
to InvalidInputException and made it slightly more descriptiveIf the provided keys are not unique, an InvalidInputException will be thrown and the operation is aborted.