Skip to content

BM25 matching scores seems to be invalid #12330

@zeppafi

Description

@zeppafi

What happens?

Result table ordering based on BM25 scores seems to be incorrect. Scores can be even negative if search query contains terms that are very common among searched documents.

To Reproduce

INSTALL fts;
LOAD fts;

CREATE OR REPLACE TABLE documents (
    id VARCHAR,
    content VARCHAR
);

INSERT INTO documents VALUES 
    ('doc1', 'DuckDB database lorem'), 
    ('doc2', 'DuckDB database ipsum'), 
    ('doc3', 'DuckDB database ipsum');

PRAGMA create_fts_index('documents', 'id', 'content');

SELECT 
    id,
    fts_main_documents.match_bm25(id, 'DuckDB') score_DuckDB,
    fts_main_documents.match_bm25(id, 'DuckDB database') score_DuckDB_Database,
    fts_main_documents.match_bm25(id, 'DuckDB database lorem') score_DuckDB_Database_lorem,
    fts_main_documents.match_bm25(id, 'lorem') score_lorem,
    fts_main_documents.match_bm25(id, 'DuckDB database ipsum') score_DuckDB_Database_ipsum,
FROM 
    documents;

Gives following result table:

┌─────────┬─────────────────────┬───────────────────────┬─────────────────────────────┬─────────────────────┬─────────────────────────────┐
│   id    │    score_DuckDB     │ score_DuckDB_Database │ score_DuckDB_Database_lorem │     score_lorem     │ score_DuckDB_Database_ipsum │
│ varchar │       double        │        double         │           double            │       double        │           double            │
├─────────┼─────────────────────┼───────────────────────┼─────────────────────────────┼─────────────────────┼─────────────────────────────┤
│ doc1    │ -0.8450980400142569 │   -1.6901960800285138 │         -1.4683473304121575 │ 0.22184874961635637 │         -1.6901960800285138 │
│ doc2    │ -0.8450980400142569 │   -1.6901960800285138 │         -1.6901960800285138 │                     │         -1.9120448296448702 │
│ doc3    │ -0.8450980400142569 │   -1.6901960800285138 │         -1.6901960800285138 │                     │         -1.9120448296448702 │
└─────────┴─────────────────────┴───────────────────────┴─────────────────────────────┴─────────────────────┴─────────────────────────────┘

Higher score means better match. You can notice that scores are counter intuitive. E.g.

  1. (Edit, Improved example): Query 'DuckDB database ipsum' (score_DuckDB_Database_ipsum) matches perfectly with documents doc2 and doc3, but clearly worse doc1 gets better score
  2. (older, more unclear example) search query 'DuckDB database lorem' is equal to doc1 content, but it gets lower score (score_DuckDB_Database_lorem: -0.8450980400142569) than only partially matching query 'lorem' (score_lorem: 0.22184874961635637).

I've identified a potential issue in the formula used to calculate the BM25 scores, specifically in the IDF component. It appears that there is a critical omission in the logarithmic term of the IDF calculation. According to the BM25 formula referenced on Wikipedia, the logarithmic function should include an increment by 1 to prevent values below 1 inside the logarithm.

The issue occurs in line 194 of the code, where the IDF part of the formula is calculated as follows:

log(((SELECT num_docs FROM %fts_schema%.stats) - df + 0.5) / (df + 0.5))

To align with the standard BM25 formula and ensure the log function operates within a valid range, the formula should be modified to:

log(((SELECT num_docs FROM %fts_schema%.stats) - df + 0.5) / (df + 0.5) + 1)

Without this adjustment, the log input can fall within the range [0,1], where the logarithm would yield negative values, potentially leading to incorrect scoring results.

I was working with PR, but I'm not sure which implementation of BM25 is wanted. There are at least two ways to fix IDF part:

  1. Use "Wikipedia" version of IDF as described above or
  2. Extend current version with special cases to avoid negative IDF. See page 4 from Arxiv paper. Paper gives formula:
    image

I haven't checked very carefully if these are valid fixes to issue, but testing with fix 1 seems to give improved ordering:

INSTALL fts;
LOAD fts;

CREATE OR REPLACE TABLE documents (
    id VARCHAR,
    content VARCHAR
);

INSERT INTO documents VALUES 
    ('doc1', 'DuckDB database lorem'), 
    ('doc2', 'DuckDB database ipsum'), 
    ('doc3', 'DuckDB database ipsum');

PRAGMA create_fts_index('documents', 'id', 'content');

--PATCH: assuming that adding +1 to log(((SELECT num_docs FROM fts_main_documents.stats) - df + 0.5) / (df + 0.5) + 1) is the fix
--                                                                                                                ---
CREATE OR REPLACE MACRO fts_main_documents.match_bm25(docname, query_string, fields := NULL, k := 1.2, b := 0.75, conjunctive := 0) AS (
            WITH tokens AS (
                SELECT DISTINCT stem(unnest(fts_main_documents.tokenize(query_string)), 'english') AS t
            ),
            fieldids AS (
                SELECT fieldid
                FROM fts_main_documents.fields
                WHERE CASE WHEN fields IS NULL THEN 1 ELSE field IN (SELECT * FROM (SELECT UNNEST(string_split(fields, ','))) AS fsq) END
            ),
            qtermids AS (
                SELECT termid
                FROM fts_main_documents.dict AS dict,
                     tokens
                WHERE dict.term = tokens.t
            ),
            qterms AS (
                SELECT termid,
                       docid
                FROM fts_main_documents.terms AS terms
                WHERE CASE WHEN fields IS NULL THEN 1 ELSE fieldid IN (SELECT * FROM fieldids) END
                  AND termid IN (SELECT qtermids.termid FROM qtermids)
            ),
			term_tf AS (
				SELECT termid,
				   	   docid,
                       COUNT(*) AS tf
				FROM qterms
				GROUP BY docid,
						 termid
			),
			cdocs AS (
				SELECT docid
				FROM qterms
				GROUP BY docid
				HAVING CASE WHEN conjunctive THEN COUNT(DISTINCT termid) = (SELECT COUNT(*) FROM tokens) ELSE 1 END
			),
            subscores AS (
                SELECT docs.docid,
                       len,
                       term_tf.termid,
                       tf,
                       df,
                       (log(((SELECT num_docs FROM fts_main_documents.stats) - df + 0.5) / (df + 0.5) + 1)* ((tf * (k + 1)/(tf + k * (1 - b + b * (len / (SELECT avgdl FROM fts_main_documents.stats))))))) AS subscore
                FROM term_tf,
					 cdocs,
					 fts_main_documents.docs AS docs,
					 fts_main_documents.dict AS dict
				WHERE term_tf.docid = cdocs.docid
				  AND term_tf.docid = docs.docid
                  AND term_tf.termid = dict.termid
            ),
			scores AS (
				SELECT docid,
					   sum(subscore) AS score
				FROM subscores
				GROUP BY docid
			)
            SELECT score
            FROM scores,
				 fts_main_documents.docs AS docs
            WHERE scores.docid = docs.docid
              AND docs.name = docname
        );

SELECT 
    id,
    fts_main_documents.match_bm25(id, 'DuckDB') score_DuckDB,
    fts_main_documents.match_bm25(id, 'DuckDB database') score_DuckDB_Database,
    fts_main_documents.match_bm25(id, 'DuckDB database lorem') score_DuckDB_Database_lorem,
    fts_main_documents.match_bm25(id, 'lorem') score_lorem,
    fts_main_documents.match_bm25(id, 'DuckDB database ipsum') score_DuckDB_Database_ipsum,
FROM 
    documents;

Gives following result table:

┌─────────┬─────────────────────┬───────────────────────┬─────────────────────────────┬─────────────────────┬─────────────────────────────┐
│   id    │    score_DuckDB     │ score_DuckDB_Database │ score_DuckDB_Database_lorem │     score_lorem     │ score_DuckDB_Database_ipsum │
│ varchar │       double        │        double         │           double            │       double        │           double            │
├─────────┼─────────────────────┼───────────────────────┼─────────────────────────────┼─────────────────────┼─────────────────────────────┤
│ doc1    │ 0.05799194697768673 │   0.11598389395537347 │          0.5419526262276546 │ 0.42596873227228116 │         0.11598389395537347 │
│ doc2    │ 0.05799194697768673 │   0.11598389395537347 │         0.11598389395537347 │                     │          0.3201038766112983 │
│ doc3    │ 0.05799194697768673 │   0.11598389395537347 │         0.11598389395537347 │                     │          0.3201038766112983 │
└─────────┴─────────────────────┴───────────────────────┴─────────────────────────────┴─────────────────────┴─────────────────────────────┘

Analysing result table after patch:

  1. (Edit, Improved example): Query 'DuckDB database ipsum' (score_DuckDB_Database_ipsum) matches perfectly with documents doc2 and doc3 and scores are better for these documents than doc1
  2. (older, more unclear example) Now search query 'DuckDB database lorem' is equal to doc1 content, and gets higher score (score_DuckDB_Database_lorem:0.5419526262276546) than only partially matching query 'lorem' (score_lorem: 0.42596873227228116).
  3. Also all scores are positive.

Please let me know if further details are needed or if I can assist in any other way.

OS:

Ubuntu

DuckDB Version:

0.10.3

DuckDB Client:

CLI

Full Name:

Jaakko Routamaa

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions