Skip to content

Initial implementation of manual ngram-based search in MongoDB #993

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

ml-evs
Copy link
Member

@ml-evs ml-evs commented Nov 6, 2024

Closes #679 -- MongoDB text indexes tokenize using whitespace and punctuation only. This PR investigates whether we can build a manual ngram index, so when searching for refcode ABCDEF, you get results if you only ask for ABC, BCD etc.

This is done by making a separate collection called item_fts that is used only for this kind of search, by storing immutable_id, type and ngrams for all items, with an index over ngrams. Lookup is then done by ngrammifying the query string and doing array lookup and ordering by the number of matches.

Will have to fiddle around to see:

  • what the optimal value of N is, and whether we need to do all N+1-grams up to a fixed range
  • how expensive this is for realistic deployment sizes
  • whether it might be better to try an edit-distance based approach for some fields

@ml-evs ml-evs added the server label Nov 6, 2024
@ml-evs ml-evs changed the title [WIP] Initial noodling with manual ngram index in MongoDB [WIP] Initial noodling with manual ngram-based search in MongoDB Nov 6, 2024
Copy link

cypress bot commented Nov 6, 2024

datalab    Run #3028

Run Properties:  status check passed Passed #3028  •  git commit 8fc03f5280 ℹ️: Merge f134257e12bec2e1868e9ca561b7e1820699c3e5 into 673a7fb50e2659db3dbb3706e343...
Project datalab
Branch Review ml-evs/mongo-fts-ngram
Run status status check passed Passed #3028
Run duration 07m 50s
Commit git commit 8fc03f5280 ℹ️: Merge f134257e12bec2e1868e9ca561b7e1820699c3e5 into 673a7fb50e2659db3dbb3706e343...
Committer Matthew Evans
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 471
View all changes introduced in this branch ↗︎

Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 93.42105% with 5 lines in your changes missing coverage. Please review.

Project coverage is 70.57%. Comparing base (673a7fb) to head (f134257).
Report is 123 commits behind head on main.

Files with missing lines Patch % Lines
pydatalab/src/pydatalab/mongo.py 91.66% 4 Missing ⚠️
pydatalab/src/pydatalab/routes/v0_1/items.py 96.29% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #993      +/-   ##
==========================================
+ Coverage   70.18%   70.57%   +0.38%     
==========================================
  Files          63       63              
  Lines        4119     4190      +71     
==========================================
+ Hits         2891     2957      +66     
- Misses       1228     1233       +5     
Files with missing lines Coverage Δ
pydatalab/src/pydatalab/main.py 64.82% <100.00%> (+0.24%) ⬆️
pydatalab/src/pydatalab/routes/v0_1/items.py 83.68% <96.29%> (+0.83%) ⬆️
pydatalab/src/pydatalab/mongo.py 84.34% <91.66%> (+5.24%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ml-evs ml-evs force-pushed the ml-evs/mongo-fts-ngram branch from 79b2bcc to dbf241f Compare November 10, 2024 15:43
@ml-evs ml-evs changed the title [WIP] Initial noodling with manual ngram-based search in MongoDB Initial implementation of manual ngram-based search in MongoDB Nov 10, 2024
@ml-evs ml-evs force-pushed the ml-evs/mongo-fts-ngram branch from 63be944 to d4969ae Compare November 10, 2024 15:58
@ml-evs
Copy link
Member Author

ml-evs commented Nov 10, 2024

I think this is ready for review now, though we should not make it the default (yet) until we can test scaling and quality of search results.

@ml-evs ml-evs added the enhancement New feature or request label Nov 10, 2024
@ml-evs ml-evs force-pushed the ml-evs/mongo-fts-ngram branch from d4969ae to 4fc2c6e Compare November 25, 2024 13:44
@@ -206,6 +206,7 @@ def create_app(
extension.init_app(app)

pydatalab.mongo.create_default_indices()
pydatalab.mongo.create_ngram_item_index()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be run on one of the API processes, or there should be a lock

@ml-evs
Copy link
Member Author

ml-evs commented Apr 4, 2025

Superseded by #1098, for now

@ml-evs ml-evs closed this Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Free text search could be improved
2 participants