Skip to content

GermanDPR Dataset Causes Cross-Encoder Failure Due to Unexpected dict #1609

@sam-hey

Description

@sam-hey

When using the GermanDPR dataset with a CrossEncoder, the dataset is returning a dict instead of a str. This results in an error because the CrossEncoder expects text data as a string.

The following error is raised when processing the dataset:

/sentence_transformers/cross_encoder/CrossEncoder.py", line 170, in smart_batching_collate_text_only
    texts[idx].append(text.strip())
AttributeError: 'dict' object has no attribute 'strip' 

I see two potential ways to address this issue:

  1. Modify the GermanDPR Dataset:
    Combine title and formatted_content fields from the dict into a single str before passing it to the CrossEncoder.
    Example: f"{title} {formatted_content}"
  2. Update CrossEncoder Logic:
    Add a check in CrossEncoder to handle cases where a dict is passed instead of a str.
    If a dict is detected, convert it to a str within the CrossEncoder.

result[id_value] = {"title": title, "text": formatted_content}

--

https://github.com/embeddings-benchmark/mteb/blob/b81b584ceb1bd8a42a676482edcc19c90de75cb1/mteb/evaluation/evaluators/RetrievalEvaluator.py#L311:L320

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions