-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[feature] Add Damerau-Levenshtein string comparison function #7035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This sort of thing would be useful for our data-linkage package Splink, to add to the fuzzy-matching options for duckdb users. |
@@ -0,0 +1,106 @@ | |||
#include "duckdb/function/scalar/string_functions.hpp" | |||
|
|||
#include <map> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have wrapped include+using of std containers in duckdb, if you change #include <map>
with #include "duckdb/common/map.hpp"
that should allow you to just use unqualified map
in the codebase.
The same goes for std::vector
, here it's actually more important to use #include "duckdb/common/vector.hpp"
because we have our own duckdb::vector
implementation, that provides a little more memory safety.
In this case you're not returning/receiving vectors from other internal methods, so it's not crucial to use it, but I'd recommend to do so anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, apologies I had not realised that. Will do 👍
const auto inf = source_len * cost_deletion + target_len * cost_insertion + 1; | ||
// minimum edit distance from prefix of source string to prefix of target string | ||
// same object as H in LW paper (with indices offset by 1) | ||
std::vector<std::vector<idx_t>> distance(source_len + 2, std::vector<idx_t>(target_len + 2, inf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the unqualified vector
instead
INSERT INTO strings VALUES ('identical', 'identical'), ('dientical', 'identical'), | ||
('dinetcila', 'identical'), ('abcdefghijk', 'bacdfzzeghki'), | ||
('abcd', 'bcda'), ('great', 'greta'), | ||
('abcdefghijklmnopqrstuvwxyz', 'abdcpoxwz'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to say we should add some extra tests for strings > 12 characters, because our string_t
inlines strings up to 12 characters, so memory issues don't really become apparent unless longer strings are used.
But I don't think it matters here, you're not creating strings, only reading existing ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and the changes, LGTM 👍
That's great @Tishj - thanks for the feedback and taking the time to look this over 👍 |
Question connected to memory usage: I see here while implementing Levenshtein the 2 arrays strategy has been used, and a note about why the O(N) instead of O(N^2) strategy was required. On Levenshtein is more trivial, but I think something similar could be used also here, potentially bringing to O(N * alphabet size) should be attainable, unsure at what complexity costs. Probably for a separate PR, but would write this allocate an O(N * M) distance table. Then still good to go! And thanks a lot for this contribution! And very last: if you could file a relevant documentation update, would be really cool, it's a matter of proposing a change on this file: https://github.com/duckdb/duckdb-web/blob/master/docs/sql/functions/char.md. |
Good point! I did wonder about this a little but was not sure where the trade-offs would be, and felt like perhaps it would be something perhaps for a follow-up.
Okay great, will do 👍 |
Thanks! |
This proposed PR adds a new function
damerau_levenshtein
implementing the Damerau-Levenshtein distance string metric. It is a modification of the Levenshtein string distance, adding the edit operation of 'adjacent transpositions' to the allowed set of substitutions, insertions, and deletions. Allowing transpositions increases the complexity enough that the implementation is not a trivial extension of the existing Levenshtein implementation.