Skip to content

FTS not working with stemmer #10254

@AndreaFrancis

Description

@AndreaFrancis

What happens?

It is not possible to execute a simple text search using a different stemmer for FTS, but it works when using a LIKE operator.

To Reproduce

I used a parquet file source with Russian text content and "russian" stemmer parameter for FTS PRAGMA as explained in the doc https://duckdb.org/docs/extensions/full_text_search.html.

See that performing a straightforward LIKE operation worked in the WHERE statement:
SELECT COUNT(*) FROM "https://huggingface.co/datasets/sberquad/resolve/refs%2Fconvert%2Fparquet/sberquad/test/0000.parquet?download=true" WHERE question LIKE 'Какие%' ;
image

Using FTS (Extension already installed and loaded):

CREATE OR REPLACE TABLE data AS SELECT context, question FROM "https://huggingface.co/datasets/sberquad/resolve/refs%2Fconvert%2Fparquet/sberquad/test/0000.parquet?download=true";
CREATE OR REPLACE SEQUENCE serial START 0 MINVALUE 0;

ALTER TABLE data ADD COLUMN id BIGINT DEFAULT nextval('serial');

SELECT COUNT(*) FROM data WHERE question LIKE 'Какие%';

PRAGMA create_fts_index('data', 'id', 'context', 'question', stemmer='russian', overwrite=1);

SELECT id, question, score FROM (SELECT *, fts_main_data.match_bm25(id, 'Какие') AS score FROM data) sq WHERE score IS NOT NULL ORDER BY score DESC;
0 rows
See that it doesn't even generate a score value for other input text:
image

OS:

Ubuntu 22.04.2 LTS x86_64

DuckDB Version:

v0.9.2 3c695d7

DuckDB Client:

v0.9.2 3c695d7

Full Name:

Andrea Soria Jimenez

Affiliation:

Hugging Face

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions