-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happens?
It is not possible to execute a simple text search using a different stemmer for FTS, but it works when using a LIKE operator.
To Reproduce
I used a parquet file source with Russian text content and "russian" stemmer parameter for FTS PRAGMA as explained in the doc https://duckdb.org/docs/extensions/full_text_search.html.
See that performing a straightforward LIKE operation worked in the WHERE statement:
SELECT COUNT(*) FROM "https://huggingface.co/datasets/sberquad/resolve/refs%2Fconvert%2Fparquet/sberquad/test/0000.parquet?download=true" WHERE question LIKE 'Какие%' ;
Using FTS (Extension already installed and loaded):
CREATE OR REPLACE TABLE data AS SELECT context, question FROM "https://huggingface.co/datasets/sberquad/resolve/refs%2Fconvert%2Fparquet/sberquad/test/0000.parquet?download=true";
CREATE OR REPLACE SEQUENCE serial START 0 MINVALUE 0;
ALTER TABLE data ADD COLUMN id BIGINT DEFAULT nextval('serial');
SELECT COUNT(*) FROM data WHERE question LIKE 'Какие%';
PRAGMA create_fts_index('data', 'id', 'context', 'question', stemmer='russian', overwrite=1);
SELECT id, question, score FROM (SELECT *, fts_main_data.match_bm25(id, 'Какие') AS score FROM data) sq WHERE score IS NOT NULL ORDER BY score DESC;
0 rows
See that it doesn't even generate a score value for other input text:
OS:
Ubuntu 22.04.2 LTS x86_64
DuckDB Version:
v0.9.2 3c695d7
DuckDB Client:
v0.9.2 3c695d7
Full Name:
Andrea Soria Jimenez
Affiliation:
Hugging Face
Have you tried this on the latest main
branch?
I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- Yes, I have