Skip to content

Conversation

tsirif
Copy link

@tsirif tsirif commented Oct 24, 2024

What: Change the way sentence deduplication is made in PairClassificationEvaluator
How: Use the same staticmethod as in RerankingEvaluator: _encode_unique_texts
Why: Compared to list(set(sentences1 + sentences2)), the function _encode_unique_texts is definitely stable in the order that the deduplicated sentences are generated. This is crucial if the implementation of model.encode is parallelizing the work via DDP, in which case the splitting of sentences to chunks across DDP workers expects that the same exact list of sentences (order too!) is given to each call of model.encode...

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

@tsirif tsirif force-pushed the fix/pairwise_stable_dedup branch from 18e0fdd to d4c8e72 Compare October 24, 2024 17:10
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR - looked it over. I Agree with @Samoed suggestions here. Otherwise, not much to add.

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Nov 11, 2024

@tsirif would love to get this merged in do you have the time to work on it (otherwise it might be worth closing it or allow others to finish it up)

@KennethEnevoldsen KennethEnevoldsen changed the base branch from main to v2.0.0 November 11, 2024 09:29
@tsirif
Copy link
Author

tsirif commented Nov 11, 2024 via email

@isaac-chung isaac-chung requested a review from Samoed December 18, 2024 20:31
@isaac-chung isaac-chung merged commit d6130ad into embeddings-benchmark:v2.0.0 Dec 19, 2024
10 checks passed
@isaac-chung
Copy link
Collaborator

Thanks @tsirif for the foundation of this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants