Universal Speculative Decoding `CandidateGenerator` #35029

keyboardAnt · 2024-11-30T18:51:32Z

📄 ICML oral (top %1): Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies, https://arxiv.org/abs/2502.05202 - Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Gaurav Jain, Moshe Wasserblat, David Harel

This PR is a collaborative effort with @jmamou and @gauravjain14. This PR supersedes #34760 and builds upon #35009.

This PR is open for initial review, although some areas are still under development.

What does this PR do?

This PR introduces the UniversalSpeculativeDecodingGenerator class, enabling speculative decoding for assistants with slightly different tokenizers. The key addition is two logits processors (LogitsProcessor) that ensure the assistant generates tokens exclusively from the target vocabulary, maintaining alignment and preserving the target distribution without altering the verification method. Theoretically, it is agnostic to the do_sample choice. This avoids issues like #32867 and #33534 and sets the stage for advanced universal speculative decoding techniques (that we are currently researching and have not yet been published).

Motivation and Context

This update resolves prior inconsistencies in speculative decoding caused by misaligned vocabularies. Key benefits include:

Ensuring the assistant generates only tokens present in the target vocabulary.
Lossless preservation of the target distribution.
Compatibility with future speculative decoding advancements.

This PR is a step toward advancements in Universal Assisted Generation, in collaboration with @danielkorat, @orenpereg, @mosheber, @jmamou, @gante, @lewtun, and @MosheWasserb.

Dependencies

No additional dependencies. However, please consider merging PR Refactoring AssistedCandidateGenerator for Improved Modularity and Reusability #35009 first.

Before Submitting Checklist

Followed the contributor guidelines.
Add functionality tests.
Verified adherence to target distribution accuracy. Please merge Speculative decoding: Test the target distribution (to prevent issues like #32867) #34553.
Measure speedups.
Add documentation.

Who can review?

@gante / @ArthurZucker / @zucchini-nlp

gauravjain14 · 2024-12-02T09:24:27Z

Hi @keyboardAnt - I checked out your changes and did the following evaluations to obtain speed-up when using universal assisted generation.

The following is the summary -

Dataset Used - tau/scrolls - qasper
Number of samples evaluated per run - 20
Avg. Speed observed = mean([baseline_time[0]/assisted_time[0], baseline_time[1]/assisted_time[1], ...., baseline_time[N-1]/assisted_time[N-1]])

Case 1:
max_new_tokens=100
Avg Speed up observed = 1.13x

Case 2:
max_new_tokens=256
Avg Speed up observed = 1.46x

Case 3:
max_new_tokens=512
Avg Speed up observed = 1.72x

A couple of things that remain to be observed -

When does that start saturating, i.e. I don't expect we will endlessly see speedup as more tokens are generated by the model(s).
Also, does this affect the accuracy between the baseline (no-assistant) generation mode and the assisted generation model?

gauravjain14 · 2024-12-02T09:58:53Z

Running more evaluation with usd, I am seeing them fail with the following errors -

../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                    
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                    
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                    
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                    
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                    
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

File "/disk/universal_assisted_generation/perf_comparison_llama_qwen.py", line 70, in <module>
    assisted_text, assisted_time = generate_assisted(
                                   ^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/perf_comparison_llama_qwen.py", line 51, in generate_assisted
    outputs = target_model.generate(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/anaconda3/envs/uag/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/generation/utils.py", line 2213, in generate
    result = self._assisted_decoding(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/generation/utils.py", line 4318, in _assisted_decoding
    outputs = self(**model_inputs)
              ^^^^^^^^^^^^^^^^^^^^
  File "/disk/anaconda3/envs/uag/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/anaconda3/envs/uag/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/anaconda3/envs/uag/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/models/llama/modeling_llama.py", line 1163, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/disk/envs/uag/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/anaconda3/envs/uag/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/models/llama/modeling_llama.py", line 883, in forward
    causal_mask = self._update_causal_mask(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/models/llama/modeling_llama.py", line 973, in _update_causal_mask
    if AttentionMaskConverter._ignore_causal_mask_sdpa(
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/modeling_attn_mask_utils.py", line 284, in _ignore_causal_mask_sdpa
    elif not is_tracing and torch.all(attention_mask == 1):
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This only occurs when the assistant model and the target model have different tokenizers. At least what I have consistently observed till now. Any idea what could be causing this?

I am running these evaluations on 4xT4 system with about 64GB memory. The models I am using are

target_checkpoint = "meta-llama/Llama-3.1-8B-Instruct"
assistant_checkpoint = "Qwen/Qwen2.5-0.5B-Instruct"
``

jmamou · 2024-12-02T10:01:17Z

Hi @keyboardAnt - I checked out your changes and did the following evaluations to obtain speed-up when using universal assisted generation.

The following is the summary -

Dataset Used - tau/scrolls - qasper Number of samples evaluated per run - 20 Avg. Speed observed = mean([baseline_time[0]/assisted_time[0], baseline_time[1]/assisted_time[1], ...., baseline_time[N-1]/assisted_time[N-1]])

Case 1: max_new_tokens=100 Avg Speed up observed = 1.13x

Case 2: max_new_tokens=256 Avg Speed up observed = 1.46x

Case 3: max_new_tokens=512 Avg Speed up observed = 1.72x

A couple of things that remain to be observed -

When does that start saturating, i.e. I don't expect we will endlessly see speedup as more tokens are generated by the model(s).

Also, does this affect the accuracy between the baseline (no-assistant) generation mode and the assisted generation model?

@gauravjain14
thanks for sharing your experiments!
I am currently running benchmark, I will hopefully share it soon.

Note that we can reach EOS token before reaching max_new_tokens and in that case, we stop generating at EOS. It often occurs in summarization tasks.
no, accuracy is not affected, USD is lossless.

jmamou · 2024-12-02T11:21:52Z

Hi @keyboardAnt - I checked out your changes and did the following evaluations to obtain speed-up when using universal assisted generation.

The following is the summary -

Dataset Used - tau/scrolls - qasper Number of samples evaluated per run - 20 Avg. Speed observed = mean([baseline_time[0]/assisted_time[0], baseline_time[1]/assisted_time[1], ...., baseline_time[N-1]/assisted_time[N-1]])

Case 1: max_new_tokens=100 Avg Speed up observed = 1.13x

Case 2: max_new_tokens=256 Avg Speed up observed = 1.46x

Case 3: max_new_tokens=512 Avg Speed up observed = 1.72x

A couple of things that remain to be observed -

When does that start saturating, i.e. I don't expect we will endlessly see speedup as more tokens are generated by the model(s).

Also, does this affect the accuracy between the baseline (no-assistant) generation mode and the assisted generation model?

@gauravjain14
which target/draft models did you use here?

jmamou · 2024-12-02T12:17:18Z

Running more evaluation with usd, I am seeing them fail with the following errors -

../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                    
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                    
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                    
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                    
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                    
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [14,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

File "/disk/universal_assisted_generation/perf_comparison_llama_qwen.py", line 70, in <module>
    assisted_text, assisted_time = generate_assisted(
                                   ^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/perf_comparison_llama_qwen.py", line 51, in generate_assisted
    outputs = target_model.generate(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/anaconda3/envs/uag/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/generation/utils.py", line 2213, in generate
    result = self._assisted_decoding(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/generation/utils.py", line 4318, in _assisted_decoding
    outputs = self(**model_inputs)
              ^^^^^^^^^^^^^^^^^^^^
  File "/disk/anaconda3/envs/uag/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/anaconda3/envs/uag/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/anaconda3/envs/uag/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/models/llama/modeling_llama.py", line 1163, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/disk/envs/uag/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/anaconda3/envs/uag/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/models/llama/modeling_llama.py", line 883, in forward
    causal_mask = self._update_causal_mask(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/models/llama/modeling_llama.py", line 973, in _update_causal_mask
    if AttentionMaskConverter._ignore_causal_mask_sdpa(
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/universal_assisted_generation/transformers/src/transformers/modeling_attn_mask_utils.py", line 284, in _ignore_causal_mask_sdpa
    elif not is_tracing and torch.all(attention_mask == 1):
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This only occurs when the assistant model and the target model have different tokenizers. At least what I have consistently observed till now. Any idea what could be causing this?

I am running these evaluations on 4xT4 system with about 64GB memory. The models I am using are

target_checkpoint = "meta-llama/Llama-3.1-8B-Instruct"
assistant_checkpoint = "Qwen/Qwen2.5-0.5B-Instruct"
``

seems to be related to #22546 (comment)

jmamou · 2024-12-02T16:17:33Z

We have run USD with target='meta-llama/Llama-3.1-70B', draft='Qwen/Qwen2-0.5B-Instruct' on scroll on 2 A100 GPU's
We got speedup of 2.65x.
As reported #34760 (comment), overlap of draft vocab w.r.t. to target vocab is 85 %

…esting file

…o_sample=True`

…ogits`

keyboardAnt · 2025-02-24T22:10:37Z

@jmamou - I’m replying to your question here.

Our evaluation so far has focused solely on single-threaded inference, but I don’t see a strong reason to restrict the code to single-threading. The thread-safe lock implementation helps prevent race conditions during multi-threaded execution and is considered standard.

@gante, do you happen to know if there are any multithreading use cases today? I know there was no multiprocessing support as of August 2024 (#32864). If there aren’t any actual use cases, do you think removing this thread-safe locking functionality would make sense?

* remove threading * fix logits_processor * fix test device

* move AssistantToTargetTranslator * fixup

gante · 2025-02-26T10:51:09Z

@keyboardAnt In general, we avoid threading in the core library whenever possible -- transformers is used in many places, and not all of them are thread-safe. Here's an example of a threading issue caused by transformers: gradio-app/gradio#4016

Not to say that these issues are not fixable, but since we lack the capacity to handle existing issues, we're trying to prevent code changes that are likely to cause issues in the future 🤗

jmamou · 2025-02-26T11:05:15Z

@keyboardAnt In general, we avoid threading in the core library whenever possible -- transformers is used in many places, and not all of them are thread-safe. Here's an example of a threading issue caused by transformers: gradio-app/gradio#4016

Not to say that these issues are not fixable, but since we lack the capacity to handle existing issues, we're trying to prevent code changes that are likely to cause issues in the future 🤗

@gante
as you suggested, we worked around using threading by building the map in generate, and passing the map to the translator candidate generator.
We addressed all the comments.

gante

Looks good to me :)

Two minor nits and I'm happy to approve and merge:

See comment below
Missing an integration test in tests/generation/test_candidate_generator: with temperature set to nearly 0, USD should match vanilla sampling. (make sure to use a small model, it can even be a pair of dummy models like hf-internal-testing/tiny-random-gpt2 and hf-internal-testing/tiny-random-MistralForCausalLM, since the actual test doesn't matter)

src/transformers/generation/utils.py

gante

LGTM, let's make CI green and merge 🤗

tests/generation/test_candidate_generator.py

jmamou · 2025-02-26T16:05:40Z

LGTM, let's make CI green and merge 🤗

@gante
CircleCI tests are green 😄

keyboardAnt mentioned this pull request Nov 30, 2024

[OLD] New PR: #35029. [[Universal Speculative Decoding CandidateGenerator]] #34760

Closed

6 tasks

jmamou mentioned this pull request Dec 2, 2024

Refactoring AssistedCandidateGenerator for Improved Modularity and Reusability #35009

Merged

4 tasks

keyboardAnt and others added 22 commits December 2, 2024 20:11

move TestAssistedCandidateGeneratorDifferentTokenizers into a new t…

aa7e01a

…esting file

refactor

f6b7f20

NOTHING. add space to rerun github actions tests

0ded37c

remove it...

d48b69b

UniversalSpeculativeDecodingGenerator

b47e33a

Use UniversalSpeculativeDecodingGenerator when `generation_config.d…

8a99129

…o_sample=True`

assistant tokenizes only the target's new suffix

4649bd2

formatting

f199c94

fix code

19c0057

fix code

acf5a4b

formatting

3712117

add TestGenerateWithDifferentModels

63f2f46

TestGenerateWithDifferentModels parameterize on do_sample

6ac33f1

AssistantVocabMapping & AssistantVocabMappingCache

6938311

formatting

5a0db3b

AssistantToTargetTranslator: get_target_input_ids & `get_target_l…

92f8ad3

…ogits`

improve _get_assistant_to_target_input_ids & formatting

7c8708e

renaming

880d0ae

WIP: debugging min_new_tokens

d9b5e74

fix get_target_ids

25974d5

UniversalSpeculativeDecodingGenerator

b8636ab

assistant tokenizes only the target's new suffix

1ef46b7

Merge branch 'main' into usd

1a79647

jmamou added 15 commits February 25, 2025 12:21

Merge branch 'main' into usd

78a2a2c

Fix Joao's comments (#21)

751a099

* remove threading * fix logits_processor * fix test device

Merge branch 'main' into usd

bfb636d

Merge branch 'main' into usd

00e325d

fix style (#23)

8a39f5b

Merge branch 'main' into usd

64c95fe

Move atm (#24)

7661fc9

* move AssistantToTargetTranslator * fixup

fix logit_processor

503ece9

add atm_translator test

fb7187d

refactor test

dedcf98

Merge branch 'main' into usd

7e3f3dc

Merge branch 'main' into usd

c9fc5a6

Merge branch 'main' into usd

94e8a31

remove threading from test

4e23470

Merge branch 'main' into usd

eae175c

jmamou added 2 commits February 26, 2025 03:00

add require_torch in tests

6784931

Merge branch 'main' into usd

d20f07b

gante reviewed Feb 26, 2025

View reviewed changes

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

jmamou added 2 commits February 26, 2025 05:14

move AssistantVocabTranslatorCache + add tests

be79a15

Merge branch 'main' into usd

9cb0a3a

gante approved these changes Feb 26, 2025

View reviewed changes

tests/generation/test_candidate_generator.py Show resolved Hide resolved

jmamou added 2 commits February 26, 2025 07:20

ruff fix

b0e7a16

Merge branch 'main' into usd

683bbee

gante merged commit d18d9c3 into huggingface:main Feb 26, 2025
21 checks passed

jmamou mentioned this pull request Mar 25, 2025

prune LM Head for USD #36695

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Universal Speculative Decoding `CandidateGenerator` #35029

Universal Speculative Decoding `CandidateGenerator` #35029

Uh oh!

keyboardAnt commented Nov 30, 2024 •

edited

Loading

Uh oh!

gauravjain14 commented Dec 2, 2024

Uh oh!

gauravjain14 commented Dec 2, 2024

Uh oh!

jmamou commented Dec 2, 2024

Uh oh!

jmamou commented Dec 2, 2024

Uh oh!

jmamou commented Dec 2, 2024 •

edited

Loading

Uh oh!

jmamou commented Dec 2, 2024

Uh oh!

keyboardAnt commented Feb 24, 2025

Uh oh!

gante commented Feb 26, 2025

Uh oh!

jmamou commented Feb 26, 2025

Uh oh!

gante left a comment

Uh oh!

Uh oh!

gante left a comment

Uh oh!

Uh oh!

jmamou commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

Universal Speculative Decoding CandidateGenerator #35029

Universal Speculative Decoding CandidateGenerator #35029

Uh oh!

Conversation

keyboardAnt commented Nov 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation and Context

Related

Dependencies

Before Submitting Checklist

Who can review?

Uh oh!

gauravjain14 commented Dec 2, 2024

Uh oh!

gauravjain14 commented Dec 2, 2024

Uh oh!

jmamou commented Dec 2, 2024

Uh oh!

jmamou commented Dec 2, 2024

Uh oh!

jmamou commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmamou commented Dec 2, 2024

Uh oh!

keyboardAnt commented Feb 24, 2025

Uh oh!

gante commented Feb 26, 2025

Uh oh!

jmamou commented Feb 26, 2025

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmamou commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

Universal Speculative Decoding `CandidateGenerator` #35029

Universal Speculative Decoding `CandidateGenerator` #35029

keyboardAnt commented Nov 30, 2024 •

edited

Loading

jmamou commented Dec 2, 2024 •

edited

Loading