Initial implementation of manual ngram-based search in MongoDB #993

ml-evs · 2024-11-06T19:07:34Z

Closes #679 -- MongoDB text indexes tokenize using whitespace and punctuation only. This PR investigates whether we can build a manual ngram index, so when searching for refcode ABCDEF, you get results if you only ask for ABC, BCD etc.

This is done by making a separate collection called item_fts that is used only for this kind of search, by storing immutable_id, type and ngrams for all items, with an index over ngrams. Lookup is then done by ngrammifying the query string and doing array lookup and ordering by the number of matches.

Will have to fiddle around to see:

what the optimal value of N is, and whether we need to do all N+1-grams up to a fixed range
how expensive this is for realistic deployment sizes
whether it might be better to try an edit-distance based approach for some fields

cypress · 2024-11-06T19:19:27Z

datalab Run #3028

Run Properties: Passed #3028 • 8fc03f5280 ℹ️: Merge f134257e12bec2e1868e9ca561b7e1820699c3e5 into 673a7fb50e2659db3dbb3706e343...

Project	`datalab`
Branch Review	`ml-evs/mongo-fts-ngram`
Run status	`Passed #3028`
Run duration	`07m 50s`
Commit	`8fc03f5280 ℹ️: Merge f134257e12bec2e1868e9ca561b7e1820699c3e5 into 673a7fb50e2659db3dbb3706e343...`
Committer	`Matthew Evans`
View all properties for this run ↗︎

Test results
Failures	`0`
Flaky	`0`
Pending	`0`
Skipped	`0`
Passing	`471`
View all changes introduced in this branch ↗︎

codecov · 2024-11-06T19:23:09Z

Codecov Report

Attention: Patch coverage is 93.42105% with 5 lines in your changes missing coverage. Please review.

Project coverage is 70.57%. Comparing base (673a7fb) to head (f134257).
Report is 123 commits behind head on main.

Files with missing lines	Patch %	Lines
pydatalab/src/pydatalab/mongo.py	91.66%	4 Missing ⚠️
pydatalab/src/pydatalab/routes/v0_1/items.py	96.29%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #993      +/-   ##
==========================================
+ Coverage   70.18%   70.57%   +0.38%     
==========================================
  Files          63       63              
  Lines        4119     4190      +71     
==========================================
+ Hits         2891     2957      +66     
- Misses       1228     1233       +5

Files with missing lines	Coverage Δ
pydatalab/src/pydatalab/main.py	`64.82% <100.00%> (+0.24%)`	⬆️
pydatalab/src/pydatalab/routes/v0_1/items.py	`83.68% <96.29%> (+0.83%)`	⬆️
pydatalab/src/pydatalab/mongo.py	`84.34% <91.66%> (+5.24%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ml-evs · 2024-11-10T15:58:34Z

I think this is ready for review now, though we should not make it the default (yet) until we can test scaling and quality of search results.

ml-evs · 2025-02-07T11:39:20Z

pydatalab/src/pydatalab/main.py

@@ -206,6 +206,7 @@ def create_app(
        extension.init_app(app)

    pydatalab.mongo.create_default_indices()
+    pydatalab.mongo.create_ngram_item_index()


This should only be run on one of the API processes, or there should be a lock

ml-evs · 2025-04-04T18:44:05Z

Superseded by #1098, for now

ml-evs added the server label Nov 6, 2024

ml-evs requested review from jdbocarsly and BenjaminCharmes as code owners November 6, 2024 19:07

ml-evs changed the title ~~[WIP] Initial noodling with manual ngram index in MongoDB~~ [WIP] Initial noodling with manual ngram-based search in MongoDB Nov 6, 2024

ml-evs mentioned this pull request Nov 7, 2024

Fix and refactor FTS field generation #998

Merged

ml-evs force-pushed the ml-evs/mongo-fts-ngram branch from 79b2bcc to dbf241f Compare November 10, 2024 15:43

ml-evs changed the title ~~[WIP] Initial noodling with manual ngram-based search in MongoDB~~ Initial implementation of manual ngram-based search in MongoDB Nov 10, 2024

ml-evs force-pushed the ml-evs/mongo-fts-ngram branch from 63be944 to d4969ae Compare November 10, 2024 15:58

ml-evs added the enhancement New feature or request label Nov 10, 2024

ml-evs force-pushed the ml-evs/mongo-fts-ngram branch from d4969ae to 4fc2c6e Compare November 25, 2024 13:44

ml-evs mentioned this pull request Nov 28, 2024

Secondary indices for vector search #1017

Open

ml-evs commented Feb 7, 2025

View reviewed changes

ml-evs and others added 6 commits April 4, 2025 16:40

Initial noodling with manual ngram index in MongoDB

22dfb2f

Add working tests

97f5865

Implement rudimentary ngram-based search with item updates and add tests

5d422be

Rebase

ec9adb7

Make sure find_one_and_update returns the updated doc

1c78cca

Add a temp. view to compare searches

f134257

ml-evs force-pushed the ml-evs/mongo-fts-ngram branch from 5faf93c to f134257 Compare April 4, 2025 15:42

ml-evs mentioned this pull request Apr 4, 2025

Add regex-based alternative to full-text search #1098

Merged

ml-evs closed this Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial implementation of manual ngram-based search in MongoDB #993

Initial implementation of manual ngram-based search in MongoDB #993

Uh oh!

ml-evs commented Nov 6, 2024 •

edited

Loading

Uh oh!

cypress bot commented Nov 6, 2024 •

edited

Loading

Uh oh!

codecov bot commented Nov 6, 2024 •

edited

Loading

Uh oh!

ml-evs commented Nov 10, 2024

Uh oh!

ml-evs Feb 7, 2025

Uh oh!

ml-evs commented Apr 4, 2025

Uh oh!

Uh oh!

Initial implementation of manual ngram-based search in MongoDB #993

Initial implementation of manual ngram-based search in MongoDB #993

Uh oh!

Conversation

ml-evs commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cypress bot commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

datalab Run #3028

Uh oh!

codecov bot commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ml-evs commented Nov 10, 2024

Uh oh!

ml-evs Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

ml-evs commented Apr 4, 2025

Uh oh!

Uh oh!

ml-evs commented Nov 6, 2024 •

edited

Loading

cypress bot commented Nov 6, 2024 •

edited

Loading

codecov bot commented Nov 6, 2024 •

edited

Loading