Skip to content

Conversation

tiagokepe
Copy link
Contributor

Hi all,

This PR provides support for collations in the LIKE operator and scalar string functions.

For the LIKE operator, we decompose the like pattern into segments using the CreateLikeMatcher, then we apply the collation on each individual literal, a brief discussion can be found here

For some cases, I had to change the bind_function_expression.cpp and push down the collations on each function child, as a good consequence the scalar string functions support collations.

I also adapted the like optimization rules to support collations while replacing the LIKE function for optimized functions such as suffix, prefix, and contains.

Examples are available on test/sql/collate/collate_like.test and test/sql/collate/collate_scalar_string_functions.test

@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 3, 2024 19:15
@tiagokepe tiagokepe marked this pull request as ready for review June 4, 2024 10:30
@duckdb-draftbot duckdb-draftbot marked this pull request as draft July 8, 2024 14:25
Copy link

github-actions bot commented Oct 9, 2024

This pull request is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Oct 9, 2024
@Mytherin Mytherin removed the stale label Oct 30, 2024
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I think this PR is really trying to do two distinct things, that should ideally be split up into two separate PRs:

  • Add collation support to LIKE
  • Add generic collation support to string functions

I think it would be better to handle these separately. As you can see in the LIKE code - not all string functions can handle collations by simply calling ExpressionBinder::PushCollation on the children. As such, I think this needs more work than what is done in this PR. For example, we could extend the function itself so that it signals how collations should be handled.

As for like itself - it seems that collations are only handled if the pattern and escape are constants. Constant patterns are only an optimization - this shifts it towards being requirement for correctness. For example, this does not work correctly for non-constant patterns in the current implementation:

SELECT 'HELLO' COLLATE NOCASE LIKE 'hel_%' AS result;
┌─────────┐
│ result  │
│ boolean │
├─────────┤
│ true    │
└─────────┘
SELECT 'HELLO' COLLATE NOCASE LIKE pattern AS result FROM (SELECT 'hel_%' pattern);
┌─────────┐
│ result  │
│ boolean │
├─────────┤
│ false   │
└─────────┘

I'm also unsure if the current approach works for collations in general. While it works for NOCASE (which only requires case-insensitivity), it does not seem to work for ICU collations:

SELECT 'HELLO' COLLATE de LIKE 'HEL%' AS result;
┌─────────┐
│ result  │
│ boolean │
├─────────┤
│ false   │
└─────────┘

I think much more extensive testing is required here. Making LIKE work with arbitrary (ICU) collations seems challenging and I'm not sure if pushing collations to each of the segments is sufficient here.

If the goal is only to support NOACCENT - we could also detect that specific collation, and just push an ILIKE operator which already exists and works in all scenarios:

D SELECT 'HELLO' ILIKE 'hel_%' AS result;
┌─────────┐
│ result  │
│ boolean │
├─────────┤
│ true    │
└─────────┘
D SELECT 'HELLO' ILIKE pattern AS result FROM (SELECT 'hel_%' pattern);
┌─────────┐
│ result  │
│ boolean │
├─────────┤
│ true    │
└─────────┘

Mytherin added a commit that referenced this pull request Nov 6, 2024
Supersedes #12359

This PR adds a new property to functions - `FunctionCollationHandling`.
This can be used to handle collations across different functions in an
easy way. There are three settings:

* **PROPAGATE_COLLATIONS (default)**: Propagate collations, but do not
apply them. This means that if the function returns `VARCHAR`, and any
of the input arguments are `VARCHAR`, the return type will inherit the
collations of the input arguments.
* **PUSH_COMBINABLE_COLLATIONS**: Propagates collations and applies
**combinable** collations prior to calling the function. Combinable
collations are "simple" collations like `nocase` and `noaccent`, but not
`icu` collations that are more complex.
* **IGNORE_COLLATIONS**: Ignore collations entirely.

`PUSH_COMBINABLE_COLLATIONS` is used for various functions, e.g.
`contains`, `starts_with`, `glob`, `like`, etc:

```sql
D SELECT 'hello' COLLATE NOCASE LIKE '%HEL%' AS result;
┌─────────┐
│ result  │
│ boolean │
├─────────┤
│ true    │
└─────────┘
```
@Mytherin
Copy link
Collaborator

Mytherin commented Nov 6, 2024

This has been superseded by #14717

@Mytherin Mytherin closed this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants