[feature] Add Damerau-Levenshtein string comparison function #7035

ADBond · 2023-04-11T17:28:33Z

This proposed PR adds a new function damerau_levenshtein implementing the Damerau-Levenshtein distance string metric. It is a modification of the Levenshtein string distance, adding the edit operation of 'adjacent transpositions' to the allowed set of substitutions, insertions, and deletions. Allowing transpositions increases the complexity enough that the implementation is not a trivial extension of the existing Levenshtein implementation.

…hm clearer

ADBond · 2023-04-11T17:29:17Z

This sort of thing would be useful for our data-linkage package Splink, to add to the fuzzy-matching options for duckdb users.

Tishj · 2023-04-13T06:59:21Z

src/function/scalar/string/damerau_levenshtein.cpp

@@ -0,0 +1,106 @@
+#include "duckdb/function/scalar/string_functions.hpp"
+
+#include <map>


We have wrapped include+using of std containers in duckdb, if you change #include <map> with #include "duckdb/common/map.hpp" that should allow you to just use unqualified map in the codebase.

The same goes for std::vector, here it's actually more important to use #include "duckdb/common/vector.hpp" because we have our own duckdb::vector implementation, that provides a little more memory safety.
In this case you're not returning/receiving vectors from other internal methods, so it's not crucial to use it, but I'd recommend to do so anyways

Ah okay, apologies I had not realised that. Will do 👍

Tishj · 2023-04-13T07:02:04Z

src/function/scalar/string/damerau_levenshtein.cpp

+	const auto inf = source_len * cost_deletion + target_len * cost_insertion + 1;
+	// minimum edit distance from prefix of source string to prefix of target string
+	// same object as H in LW paper (with indices offset by 1)
+	std::vector<std::vector<idx_t>> distance(source_len + 2, std::vector<idx_t>(target_len + 2, inf));


Please use the unqualified vector instead

Tishj · 2023-04-13T07:16:10Z

test/sql/function/string/test_damerau_levenshtein.test

+INSERT INTO strings VALUES 	('identical', 'identical'), ('dientical', 'identical'),
+							('dinetcila', 'identical'), ('abcdefghijk', 'bacdfzzeghki'),
+							('abcd', 'bcda'), ('great', 'greta'),
+							('abcdefghijklmnopqrstuvwxyz', 'abdcpoxwz'),


I want to say we should add some extra tests for strings > 12 characters, because our string_t inlines strings up to 12 characters, so memory issues don't really become apparent unless longer strings are used.
But I don't think it matters here, you're not creating strings, only reading existing ones.

Tishj

Thanks for the PR and the changes, LGTM 👍

ADBond · 2023-04-13T14:07:58Z

That's great @Tishj - thanks for the feedback and taking the time to look this over 👍

carlopi · 2023-04-13T15:10:43Z

Question connected to memory usage: I see here while implementing Levenshtein the 2 arrays strategy has been used, and a note about why the O(N) instead of O(N^2) strategy was required. On Levenshtein is more trivial, but I think something similar could be used also here, potentially bringing to O(N * alphabet size) should be attainable, unsure at what complexity costs. Probably for a separate PR, but would write this allocate an O(N * M) distance table.
I would add a stress tests with increasingly longer strings to at least find an estimate of what's the breaking point here.

Then still good to go! And thanks a lot for this contribution!

And very last: if you could file a relevant documentation update, would be really cool, it's a matter of proposing a change on this file: https://github.com/duckdb/duckdb-web/blob/master/docs/sql/functions/char.md.

ADBond · 2023-04-13T16:12:32Z

Question connected to memory usage: I see here while implementing Levenshtein the 2 arrays strategy has been used, and a note about why the O(N) instead of O(N^2) strategy was required. On Levenshtein is more trivial, but I think something similar could be used also here, potentially bringing to O(N * alphabet size) should be attainable, unsure at what complexity costs. Probably for a separate PR, but would write this allocate an O(N * M) distance table. I would add a stress tests with increasingly longer strings to at least find an estimate of what's the breaking point here.

Good point! I did wonder about this a little but was not sure where the trade-offs would be, and felt like perhaps it would be something perhaps for a follow-up.

And very last: if you could file a relevant documentation update, would be really cool, it's a matter of proposing a change on this file: https://github.com/duckdb/duckdb-web/blob/master/docs/sql/functions/char.md.

Okay great, will do 👍

Mytherin · 2023-04-14T08:08:04Z

Thanks!

ADBond added 11 commits April 11, 2023 17:32

Some initial unit tests for damerau_levenshtein function

5c997b9

add damerau_levenshtein structure

1999aae

rough translated version of Lowrance-Wagner algorithm

ad9615a

fix loop variable bound

cb5484c

more damerau_levenshtein tests

3a718e7

a few more damerau_levenshtein tests

383a1a3

fix format

b4c7af1

make cost values explicit + improve typing

2959fa5

rename some variables to make damerau_levenshtein calculation algorit…

1918c51

…hm clearer

more variable renamings

f2ff61c

rename vars, some comments, tidy

003d88e

RossKen mentioned this pull request Apr 12, 2023

[FEAT] Implement Damerau-Levenshtein into Comparison Level Library moj-analytical-services/splink#1111

Closed

Mytherin requested a review from Tishj April 12, 2023 15:05

Tishj reviewed Apr 13, 2023

View reviewed changes

ADBond added 3 commits April 13, 2023 10:24

use duckdb wrappers of map and vector

1f9f7d9

correct constexpr style

958df78

a few more test cases, with some longer strings

ff9a62c

Tishj approved these changes Apr 13, 2023

View reviewed changes

ADBond mentioned this pull request Apr 13, 2023

Add documentation for damerau_levenshtein function duckdb/duckdb-web#687

Merged

Mytherin merged commit 4088655 into duckdb:master Apr 14, 2023

ADBond deleted the damerau-levenshtein branch April 14, 2023 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature] Add Damerau-Levenshtein string comparison function #7035

[feature] Add Damerau-Levenshtein string comparison function #7035

ADBond commented Apr 11, 2023

Uh oh!

ADBond commented Apr 11, 2023

Uh oh!

Tishj Apr 13, 2023

Uh oh!

ADBond Apr 13, 2023

Uh oh!

Tishj Apr 13, 2023

Uh oh!

Tishj Apr 13, 2023

Uh oh!

Tishj left a comment

Uh oh!

ADBond commented Apr 13, 2023

Uh oh!

carlopi commented Apr 13, 2023 •

edited

Loading

Uh oh!

ADBond commented Apr 13, 2023

Uh oh!

Mytherin commented Apr 14, 2023

Uh oh!

Uh oh!

		@@ -0,0 +1,106 @@
		#include "duckdb/function/scalar/string_functions.hpp"

		#include <map>

[feature] Add Damerau-Levenshtein string comparison function #7035

[feature] Add Damerau-Levenshtein string comparison function #7035

Conversation

ADBond commented Apr 11, 2023

Uh oh!

ADBond commented Apr 11, 2023

Uh oh!

Tishj Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

ADBond Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

Tishj Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

Tishj Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

ADBond commented Apr 13, 2023

Uh oh!

carlopi commented Apr 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ADBond commented Apr 13, 2023

Uh oh!

Mytherin commented Apr 14, 2023

Uh oh!

Uh oh!

carlopi commented Apr 13, 2023 •

edited

Loading