Skip to content

Conversation

ADBond
Copy link
Contributor

@ADBond ADBond commented Apr 11, 2023

This proposed PR adds a new function damerau_levenshtein implementing the Damerau-Levenshtein distance string metric. It is a modification of the Levenshtein string distance, adding the edit operation of 'adjacent transpositions' to the allowed set of substitutions, insertions, and deletions. Allowing transpositions increases the complexity enough that the implementation is not a trivial extension of the existing Levenshtein implementation.

@ADBond
Copy link
Contributor Author

ADBond commented Apr 11, 2023

This sort of thing would be useful for our data-linkage package Splink, to add to the fuzzy-matching options for duckdb users.

@@ -0,0 +1,106 @@
#include "duckdb/function/scalar/string_functions.hpp"

#include <map>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have wrapped include+using of std containers in duckdb, if you change #include <map> with #include "duckdb/common/map.hpp" that should allow you to just use unqualified map in the codebase.

The same goes for std::vector, here it's actually more important to use #include "duckdb/common/vector.hpp" because we have our own duckdb::vector implementation, that provides a little more memory safety.
In this case you're not returning/receiving vectors from other internal methods, so it's not crucial to use it, but I'd recommend to do so anyways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, apologies I had not realised that. Will do 👍

const auto inf = source_len * cost_deletion + target_len * cost_insertion + 1;
// minimum edit distance from prefix of source string to prefix of target string
// same object as H in LW paper (with indices offset by 1)
std::vector<std::vector<idx_t>> distance(source_len + 2, std::vector<idx_t>(target_len + 2, inf));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the unqualified vector instead

INSERT INTO strings VALUES ('identical', 'identical'), ('dientical', 'identical'),
('dinetcila', 'identical'), ('abcdefghijk', 'bacdfzzeghki'),
('abcd', 'bcda'), ('great', 'greta'),
('abcdefghijklmnopqrstuvwxyz', 'abdcpoxwz'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to say we should add some extra tests for strings > 12 characters, because our string_t inlines strings up to 12 characters, so memory issues don't really become apparent unless longer strings are used.
But I don't think it matters here, you're not creating strings, only reading existing ones.

Copy link
Contributor

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and the changes, LGTM 👍

@ADBond
Copy link
Contributor Author

ADBond commented Apr 13, 2023

That's great @Tishj - thanks for the feedback and taking the time to look this over 👍

@carlopi
Copy link
Contributor

carlopi commented Apr 13, 2023

Question connected to memory usage: I see here while implementing Levenshtein the 2 arrays strategy has been used, and a note about why the O(N) instead of O(N^2) strategy was required. On Levenshtein is more trivial, but I think something similar could be used also here, potentially bringing to O(N * alphabet size) should be attainable, unsure at what complexity costs. Probably for a separate PR, but would write this allocate an O(N * M) distance table.
I would add a stress tests with increasingly longer strings to at least find an estimate of what's the breaking point here.

Then still good to go! And thanks a lot for this contribution!

And very last: if you could file a relevant documentation update, would be really cool, it's a matter of proposing a change on this file: https://github.com/duckdb/duckdb-web/blob/master/docs/sql/functions/char.md.

@ADBond
Copy link
Contributor Author

ADBond commented Apr 13, 2023

Question connected to memory usage: I see here while implementing Levenshtein the 2 arrays strategy has been used, and a note about why the O(N) instead of O(N^2) strategy was required. On Levenshtein is more trivial, but I think something similar could be used also here, potentially bringing to O(N * alphabet size) should be attainable, unsure at what complexity costs. Probably for a separate PR, but would write this allocate an O(N * M) distance table. I would add a stress tests with increasingly longer strings to at least find an estimate of what's the breaking point here.

Good point! I did wonder about this a little but was not sure where the trade-offs would be, and felt like perhaps it would be something perhaps for a follow-up.

And very last: if you could file a relevant documentation update, would be really cool, it's a matter of proposing a change on this file: https://github.com/duckdb/duckdb-web/blob/master/docs/sql/functions/char.md.

Okay great, will do 👍

@Mytherin
Copy link
Collaborator

Thanks!

@ADBond ADBond deleted the damerau-levenshtein branch April 14, 2023 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants