-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Universal Speculative Decoding CandidateGenerator
#35029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @keyboardAnt - I checked out your changes and did the following evaluations to obtain speed-up when using The following is the summary - Dataset Used - Case 1: Case 2: Case 3: A couple of things that remain to be observed -
|
Running more evaluation with
This only occurs when the assistant model and the target model have different tokenizers. At least what I have consistently observed till now. Any idea what could be causing this? I am running these evaluations on 4xT4 system with about 64GB memory. The models I am using are
|
@gauravjain14
|
@gauravjain14 |
seems to be related to #22546 (comment) |
We have run USD with |
@jmamou - I’m replying to your question here. Our evaluation so far has focused solely on single-threaded inference, but I don’t see a strong reason to restrict the code to single-threading. The thread-safe lock implementation helps prevent race conditions during multi-threaded execution and is considered standard. @gante, do you happen to know if there are any multithreading use cases today? I know there was no multiprocessing support as of August 2024 (#32864). If there aren’t any actual use cases, do you think removing this thread-safe locking functionality would make sense? |
* remove threading * fix logits_processor * fix test device
* move AssistantToTargetTranslator * fixup
@keyboardAnt In general, we avoid threading in the core library whenever possible -- Not to say that these issues are not fixable, but since we lack the capacity to handle existing issues, we're trying to prevent code changes that are likely to cause issues in the future 🤗 |
@gante |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me :)
Two minor nits and I'm happy to approve and merge:
- See comment below
- Missing an integration test in
tests/generation/test_candidate_generator
: with temperature set to nearly 0, USD should match vanilla sampling. (make sure to use a small model, it can even be a pair of dummy models likehf-internal-testing/tiny-random-gpt2
andhf-internal-testing/tiny-random-MistralForCausalLM
, since the actual test doesn't matter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's make CI green and merge 🤗
@gante |
📄 ICML oral (top %1): Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies, https://arxiv.org/abs/2502.05202 - Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Gaurav Jain, Moshe Wasserblat, David Harel
This PR is a collaborative effort with @jmamou and @gauravjain14. This PR supersedes #34760 and builds upon #35009.
This PR is open for initial review, although some areas are still under development.
What does this PR do?
This PR introduces the
UniversalSpeculativeDecodingGenerator
class, enabling speculative decoding for assistants with slightly different tokenizers. The key addition is two logits processors (LogitsProcessor
) that ensure the assistant generates tokens exclusively from the target vocabulary, maintaining alignment and preserving the target distribution without altering the verification method. Theoretically, it is agnostic to thedo_sample
choice. This avoids issues like #32867 and #33534 and sets the stage for advanced universal speculative decoding techniques (that we are currently researching and have not yet been published).Motivation and Context
This update resolves prior inconsistencies in speculative decoding caused by misaligned vocabularies. Key benefits include:
This PR is a step toward advancements in Universal Assisted Generation, in collaboration with @danielkorat, @orenpereg, @mosheber, @jmamou, @gante, @lewtun, and @MosheWasserb.
Related
Issues:
PRs:
test_generated_length_assisted_generation
#34935 - please review and mergeDependencies
AssistedCandidateGenerator
for Improved Modularity and Reusability #35009 first.Before Submitting Checklist
Who can review?
@gante / @ArthurZucker / @zucchini-nlp