-
Notifications
You must be signed in to change notification settings - Fork 22
Initial implementation of manual ngram-based search in MongoDB #993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
datalab
|
Project |
datalab
|
Branch Review |
ml-evs/mongo-fts-ngram
|
Run status |
|
Run duration | 07m 50s |
Commit |
|
Committer | Matthew Evans |
View all properties for this run ↗︎ |
Test results | |
---|---|
|
0
|
|
0
|
|
0
|
|
0
|
|
471
|
View all changes introduced in this branch ↗︎ |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #993 +/- ##
==========================================
+ Coverage 70.18% 70.57% +0.38%
==========================================
Files 63 63
Lines 4119 4190 +71
==========================================
+ Hits 2891 2957 +66
- Misses 1228 1233 +5
🚀 New features to boost your workflow:
|
79b2bcc
to
dbf241f
Compare
63be944
to
d4969ae
Compare
I think this is ready for review now, though we should not make it the default (yet) until we can test scaling and quality of search results. |
d4969ae
to
4fc2c6e
Compare
@@ -206,6 +206,7 @@ def create_app( | |||
extension.init_app(app) | |||
|
|||
pydatalab.mongo.create_default_indices() | |||
pydatalab.mongo.create_ngram_item_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only be run on one of the API processes, or there should be a lock
5faf93c
to
f134257
Compare
Superseded by #1098, for now |
Closes #679 -- MongoDB text indexes tokenize using whitespace and punctuation only. This PR investigates whether we can build a manual ngram index, so when searching for refcode
ABCDEF
, you get results if you only ask forABC
,BCD
etc.This is done by making a separate collection called
item_fts
that is used only for this kind of search, by storingimmutable_id
,type
andngrams
for all items, with an index overngrams
. Lookup is then done by ngrammifying the query string and doing array lookup and ordering by the number of matches.Will have to fiddle around to see:
N
is, and whether we need to do all N+1-grams up to a fixed range