Skip to content

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Jun 3, 2022

This PR adds a requirement to the map function as discussed in #3640

It uses the list_unique function to verify that all keys are unique, that's why it is also subject to the same limitations as list_unique, being that the type provided has to be supported by the histogram aggregrate function.

# Type is not implemented in the histogram aggregrate function
statement error
select MAP(LIST_VALUE([1],[2],[3],[4]),LIST_VALUE(10,9,8,7))

That's why I've changed that exception type in GetHistogramFunction to InvalidInputException and made it slightly more descriptive

If the provided keys are not unique, an InvalidInputException will be thrown and the operation is aborted.

@Tishj
Copy link
Contributor Author

Tishj commented Jun 3, 2022

Maybe I should change the GetHistogramFunction so it doesn't always throw an exception if the type isn't supported so I can handle it at the call site and throw a more descriptive exception?

@Tishj
Copy link
Contributor Author

Tishj commented Jun 3, 2022

Some of the CI tests are failing because of the Histogram aggregate issue I mentioned, might be worth it to temporarily create an inefficient brute check for uniqueness for those types? Or just allow non-unique keys for those types for now

@Tishj
Copy link
Contributor Author

Tishj commented Jun 7, 2022

@pdet What do you think is the right course of action to take for now?

Should I allow non-unique keys for types that aren't supported by the Histogram aggregate yet?
Or should I write a (probably inefficient) check for those cases?

@pdet
Copy link
Contributor

pdet commented Jun 7, 2022

What types are missing? Only nested types?

@Tishj
Copy link
Contributor Author

Tishj commented Jun 7, 2022

I'm not sure, when I comment out the default it tells me 21 types aren't implemented, but definitely not all of those need to be implemented.

template <bool IS_ORDERED = true>
AggregateFunction GetHistogramFunction(const LogicalType &type) {

	switch (type.id()) {
	case LogicalType::BOOLEAN:
		return GetMapType<HistogramFunctor, bool, IS_ORDERED>(type);
	case LogicalType::UTINYINT:
		return GetMapType<HistogramFunctor, uint8_t, IS_ORDERED>(type);
	case LogicalType::USMALLINT:
		return GetMapType<HistogramFunctor, uint16_t, IS_ORDERED>(type);
	case LogicalType::UINTEGER:
		return GetMapType<HistogramFunctor, uint32_t, IS_ORDERED>(type);
	case LogicalType::UBIGINT:
		return GetMapType<HistogramFunctor, uint64_t, IS_ORDERED>(type);
	case LogicalType::TINYINT:
		return GetMapType<HistogramFunctor, int8_t, IS_ORDERED>(type);
	case LogicalType::SMALLINT:
		return GetMapType<HistogramFunctor, int16_t, IS_ORDERED>(type);
	case LogicalType::INTEGER:
		return GetMapType<HistogramFunctor, int32_t, IS_ORDERED>(type);
	case LogicalType::BIGINT:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::FLOAT:
		return GetMapType<HistogramFunctor, float, IS_ORDERED>(type);
	case LogicalType::DOUBLE:
		return GetMapType<HistogramFunctor, double, IS_ORDERED>(type);
	case LogicalType::VARCHAR:
		return GetMapType<HistogramStringFunctor, string, IS_ORDERED>(type);
	case LogicalType::TIMESTAMP:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIMESTAMP_TZ:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIMESTAMP_S:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIMESTAMP_MS:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIMESTAMP_NS:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIME:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIME_TZ:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::DATE:
		return GetMapType<HistogramFunctor, int32_t, IS_ORDERED>(type);
	default:
		throw InternalException("Unimplemented histogram aggregate");
	}
}

@pdet TL;DR yes probably only nested types aren't supported

@pdet
Copy link
Contributor

pdet commented Jun 7, 2022

I think it is fine if you just throw an error if it's not one of these types.

@Tishj
Copy link
Contributor Author

Tishj commented Jun 7, 2022

That is also an option haha hadn't considered that one if I'm honest.
I'll need to change some tests that use MAP with keys that aren't of this type then 👍

@Tishj
Copy link
Contributor Author

Tishj commented Jun 9, 2022

@pdet Could you take a look at this, see if there's anything I still need to fix/change?

Comment on lines 12 to 75
bool KeyListIsEmpty(list_entry_t *data, idx_t rows) {
for (idx_t i = 0; i < rows; i++) {
auto size = data[i].length;
if (size != 0) {
return false;
}
}
return true;
}

static void CheckKeyValidity(DataChunk &args) {
auto types = args.GetTypes();
if (types.empty()) {
return;
}
auto key_type = ListType::GetChildType(types[0]).id();
auto &array = args.data[0];
auto arg_data = FlatVector::GetData<list_entry_t>(array);
auto &entries = ListVector::GetEntry(array);
if (key_type == LogicalTypeId::SQLNULL) {
if (KeyListIsEmpty(arg_data, args.size())) {
return;
}
// The entire key list is NULL for one (or more) of the rows: (ARRAY[NULL, NULL, NULL])
throw InvalidInputException("Map keys can not be NULL");
}

VectorData list_data;
auto count = ListVector::GetListSize(array);
args.data[0].Orrify(args.size(), list_data);
auto validity = FlatVector::Validity(entries);
if (!validity.CheckAllValid(count)) {
throw InvalidInputException("Map keys can not be NULL");
}
}

static void CheckForKeyUniqueness(DataChunk &args, ExpressionState &state) {
// Create a copy of the arguments
auto types = args.GetTypes();
if (types.empty() || ListType::GetChildType(types[0]).id() == LogicalType::SQLNULL) {
return;
}

auto arg_data = FlatVector::GetData<list_entry_t>(args.data[0]);

DataChunk keys;
keys.Initialize(args.GetTypes());
args.Copy(keys);

// Split the copy to separate the keys
DataChunk remaining_columns;
keys.Split(remaining_columns, 1);

Vector unique_result(LogicalType::UBIGINT, args.size());
ListUniqueFunction(keys, state, unique_result);
for (idx_t i = 0; i < args.size(); i++) {
auto keys_length = arg_data[i].length;
auto unique_keys = FlatVector::GetValue<uint64_t>(unique_result, i);
if (unique_keys != keys_length) {
throw InvalidInputException("Map keys have to be unique!");
}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, wouldn't it be better if these are
AreKeysUnique and AreKeysValid functions that return a bool with the exception being thrown in the MapFunction?
Also, how are these functions called when you have a map type not created by SQL?
e.g., if you have a malformed pyarrow table with a map type and then consume it through duckdb, I think these functions won't be called, am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that could definitely be the case, I will test those cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_type = pa.map_(pa.int32(), pa.int32())
values = [
	[
		(3, 12),
		(3, 21)
	],
	[
		(5, 42)
	]
]
arrow_table = pa.table(
	{'detail': pa.array(values, map_type)}
)
rel = duckdb.from_arrow(arrow_table).fetchall()
[({'key': [3, 3], 'value': [12, 21]},), ({'key': [5], 'value': [42]},)]

Your hunch was correct, this error isn't caught

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the error is caught and tested for 👍

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Some more comments:

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Looking great. Some more comments:

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Almost there. Some more pointers:


namespace duckdb {

// TODO: this doesn't recursively verify maps if maps are nested
void VerifyMap(Vector &map, idx_t count, const SelectionVector &sel) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split this function into a function that returns an enum class MapInvalidReason : uint8_t { MAP_VALID, MAP_NULL_KEY_LIST, MAP_NULL_KEY, MAP_NOT_UNIQUE }; and a function that throws the InvalidInputExceptions?

in Vector::Verify we want to throw an InternalException if the map is not valid, and not an InvalidInputException. We want to do something like:

D_ASSERT(VerifyMapInternal(map, count, sel) == MapInvalidReason::MAP_VALID);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍
I turned VerifyMap into a static method of Vector, that just asserts that it's valid
CheckMapValidity now returns the enum
and I've added two functions that handle the errors
One in arrow_conversion.cpp and one in map.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the ones in Arrow conversion should never be hit, because the only thing we don't share is the unique requirement, but I added them for completeness

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates - last comment, then it is good to go.

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tishj
Copy link
Contributor Author

Tishj commented Jun 17, 2022

Thanks, now all that's left is to pray to the CI gods to be merciful

@Mytherin Mytherin merged commit ee02a53 into duckdb:master Jun 18, 2022
@Mytherin
Copy link
Collaborator

Looks like they were. Great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants