GermanDPR Dataset Causes Cross-Encoder Failure Due to Unexpected dict

When using the GermanDPR dataset with a CrossEncoder, the dataset is returning a dict instead of a str. This results in an error because the CrossEncoder expects text data as a string.

The following error is raised when processing the dataset:
```
/sentence_transformers/cross_encoder/CrossEncoder.py", line 170, in smart_batching_collate_text_only
    texts[idx].append(text.strip())
AttributeError: 'dict' object has no attribute 'strip' 
```

I see two potential ways to address this issue:

1. Modify the GermanDPR Dataset:
Combine title and formatted_content fields from the dict into a single str before passing it to the CrossEncoder.
Example: `f"{title} {formatted_content}"`
2. Update CrossEncoder Logic:
Add a check in CrossEncoder to handle cases where a dict is passed instead of a str.
If a dict is detected, convert it to a str within the CrossEncoder.




https://github.com/embeddings-benchmark/mteb/blob/b81b584ceb1bd8a42a676482edcc19c90de75cb1/mteb/tasks/Retrieval/deu/GermanDPRRetrieval.py#L57

-- 

https://github.com/embeddings-benchmark/mteb/blob/b81b584ceb1bd8a42a676482edcc19c90de75cb1/mteb/evaluation/evaluators/RetrievalEvaluator.py#L311:L320


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GermanDPR Dataset Causes Cross-Encoder Failure Due to Unexpected dict #1609

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GermanDPR Dataset Causes Cross-Encoder Failure Due to Unexpected dict #1609

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions