-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
What happens?
Result table ordering based on BM25 scores seems to be incorrect. Scores can be even negative if search query contains terms that are very common among searched documents.
To Reproduce
INSTALL fts;
LOAD fts;
CREATE OR REPLACE TABLE documents (
id VARCHAR,
content VARCHAR
);
INSERT INTO documents VALUES
('doc1', 'DuckDB database lorem'),
('doc2', 'DuckDB database ipsum'),
('doc3', 'DuckDB database ipsum');
PRAGMA create_fts_index('documents', 'id', 'content');
SELECT
id,
fts_main_documents.match_bm25(id, 'DuckDB') score_DuckDB,
fts_main_documents.match_bm25(id, 'DuckDB database') score_DuckDB_Database,
fts_main_documents.match_bm25(id, 'DuckDB database lorem') score_DuckDB_Database_lorem,
fts_main_documents.match_bm25(id, 'lorem') score_lorem,
fts_main_documents.match_bm25(id, 'DuckDB database ipsum') score_DuckDB_Database_ipsum,
FROM
documents;
Gives following result table:
┌─────────┬─────────────────────┬───────────────────────┬─────────────────────────────┬─────────────────────┬─────────────────────────────┐
│ id │ score_DuckDB │ score_DuckDB_Database │ score_DuckDB_Database_lorem │ score_lorem │ score_DuckDB_Database_ipsum │
│ varchar │ double │ double │ double │ double │ double │
├─────────┼─────────────────────┼───────────────────────┼─────────────────────────────┼─────────────────────┼─────────────────────────────┤
│ doc1 │ -0.8450980400142569 │ -1.6901960800285138 │ -1.4683473304121575 │ 0.22184874961635637 │ -1.6901960800285138 │
│ doc2 │ -0.8450980400142569 │ -1.6901960800285138 │ -1.6901960800285138 │ │ -1.9120448296448702 │
│ doc3 │ -0.8450980400142569 │ -1.6901960800285138 │ -1.6901960800285138 │ │ -1.9120448296448702 │
└─────────┴─────────────────────┴───────────────────────┴─────────────────────────────┴─────────────────────┴─────────────────────────────┘
Higher score means better match. You can notice that scores are counter intuitive. E.g.
- (Edit, Improved example): Query 'DuckDB database ipsum' (score_DuckDB_Database_ipsum) matches perfectly with documents doc2 and doc3, but clearly worse doc1 gets better score
- (older, more unclear example) search query 'DuckDB database lorem' is equal to doc1 content, but it gets lower score (score_DuckDB_Database_lorem: -0.8450980400142569) than only partially matching query 'lorem' (score_lorem: 0.22184874961635637).
I've identified a potential issue in the formula used to calculate the BM25 scores, specifically in the IDF component. It appears that there is a critical omission in the logarithmic term of the IDF calculation. According to the BM25 formula referenced on Wikipedia, the logarithmic function should include an increment by 1 to prevent values below 1 inside the logarithm.
The issue occurs in line 194 of the code, where the IDF part of the formula is calculated as follows:
log(((SELECT num_docs FROM %fts_schema%.stats) - df + 0.5) / (df + 0.5))
To align with the standard BM25 formula and ensure the log function operates within a valid range, the formula should be modified to:
log(((SELECT num_docs FROM %fts_schema%.stats) - df + 0.5) / (df + 0.5) + 1)
Without this adjustment, the log input can fall within the range [0,1], where the logarithm would yield negative values, potentially leading to incorrect scoring results.
I was working with PR, but I'm not sure which implementation of BM25 is wanted. There are at least two ways to fix IDF part:
- Use "Wikipedia" version of IDF as described above or
- Extend current version with special cases to avoid negative IDF. See page 4 from Arxiv paper. Paper gives formula:
I haven't checked very carefully if these are valid fixes to issue, but testing with fix 1 seems to give improved ordering:
INSTALL fts;
LOAD fts;
CREATE OR REPLACE TABLE documents (
id VARCHAR,
content VARCHAR
);
INSERT INTO documents VALUES
('doc1', 'DuckDB database lorem'),
('doc2', 'DuckDB database ipsum'),
('doc3', 'DuckDB database ipsum');
PRAGMA create_fts_index('documents', 'id', 'content');
--PATCH: assuming that adding +1 to log(((SELECT num_docs FROM fts_main_documents.stats) - df + 0.5) / (df + 0.5) + 1) is the fix
-- ---
CREATE OR REPLACE MACRO fts_main_documents.match_bm25(docname, query_string, fields := NULL, k := 1.2, b := 0.75, conjunctive := 0) AS (
WITH tokens AS (
SELECT DISTINCT stem(unnest(fts_main_documents.tokenize(query_string)), 'english') AS t
),
fieldids AS (
SELECT fieldid
FROM fts_main_documents.fields
WHERE CASE WHEN fields IS NULL THEN 1 ELSE field IN (SELECT * FROM (SELECT UNNEST(string_split(fields, ','))) AS fsq) END
),
qtermids AS (
SELECT termid
FROM fts_main_documents.dict AS dict,
tokens
WHERE dict.term = tokens.t
),
qterms AS (
SELECT termid,
docid
FROM fts_main_documents.terms AS terms
WHERE CASE WHEN fields IS NULL THEN 1 ELSE fieldid IN (SELECT * FROM fieldids) END
AND termid IN (SELECT qtermids.termid FROM qtermids)
),
term_tf AS (
SELECT termid,
docid,
COUNT(*) AS tf
FROM qterms
GROUP BY docid,
termid
),
cdocs AS (
SELECT docid
FROM qterms
GROUP BY docid
HAVING CASE WHEN conjunctive THEN COUNT(DISTINCT termid) = (SELECT COUNT(*) FROM tokens) ELSE 1 END
),
subscores AS (
SELECT docs.docid,
len,
term_tf.termid,
tf,
df,
(log(((SELECT num_docs FROM fts_main_documents.stats) - df + 0.5) / (df + 0.5) + 1)* ((tf * (k + 1)/(tf + k * (1 - b + b * (len / (SELECT avgdl FROM fts_main_documents.stats))))))) AS subscore
FROM term_tf,
cdocs,
fts_main_documents.docs AS docs,
fts_main_documents.dict AS dict
WHERE term_tf.docid = cdocs.docid
AND term_tf.docid = docs.docid
AND term_tf.termid = dict.termid
),
scores AS (
SELECT docid,
sum(subscore) AS score
FROM subscores
GROUP BY docid
)
SELECT score
FROM scores,
fts_main_documents.docs AS docs
WHERE scores.docid = docs.docid
AND docs.name = docname
);
SELECT
id,
fts_main_documents.match_bm25(id, 'DuckDB') score_DuckDB,
fts_main_documents.match_bm25(id, 'DuckDB database') score_DuckDB_Database,
fts_main_documents.match_bm25(id, 'DuckDB database lorem') score_DuckDB_Database_lorem,
fts_main_documents.match_bm25(id, 'lorem') score_lorem,
fts_main_documents.match_bm25(id, 'DuckDB database ipsum') score_DuckDB_Database_ipsum,
FROM
documents;
Gives following result table:
┌─────────┬─────────────────────┬───────────────────────┬─────────────────────────────┬─────────────────────┬─────────────────────────────┐
│ id │ score_DuckDB │ score_DuckDB_Database │ score_DuckDB_Database_lorem │ score_lorem │ score_DuckDB_Database_ipsum │
│ varchar │ double │ double │ double │ double │ double │
├─────────┼─────────────────────┼───────────────────────┼─────────────────────────────┼─────────────────────┼─────────────────────────────┤
│ doc1 │ 0.05799194697768673 │ 0.11598389395537347 │ 0.5419526262276546 │ 0.42596873227228116 │ 0.11598389395537347 │
│ doc2 │ 0.05799194697768673 │ 0.11598389395537347 │ 0.11598389395537347 │ │ 0.3201038766112983 │
│ doc3 │ 0.05799194697768673 │ 0.11598389395537347 │ 0.11598389395537347 │ │ 0.3201038766112983 │
└─────────┴─────────────────────┴───────────────────────┴─────────────────────────────┴─────────────────────┴─────────────────────────────┘
Analysing result table after patch:
- (Edit, Improved example): Query 'DuckDB database ipsum' (score_DuckDB_Database_ipsum) matches perfectly with documents doc2 and doc3 and scores are better for these documents than doc1
- (older, more unclear example) Now search query 'DuckDB database lorem' is equal to doc1 content, and gets higher score (score_DuckDB_Database_lorem:0.5419526262276546) than only partially matching query 'lorem' (score_lorem: 0.42596873227228116).
- Also all scores are positive.
Please let me know if further details are needed or if I can assist in any other way.
OS:
Ubuntu
DuckDB Version:
0.10.3
DuckDB Client:
CLI
Full Name:
Jaakko Routamaa
Affiliation:
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have