Assisted decoding multi-gpu #35116

zucchini-nlp · 2024-12-06T09:29:54Z

What does this PR do?

Fixes #35099, we need to manually move inputs to the correct device as assistant and target models can be initialized on different devices. We already move the input to assistant's device when get_candidate_inputs but don't do it in target models' forward

Btw, seems like we are now having more troubles with certain generation techniques or cache not working in multi-gpu setting. And we don't have much tests for multiple gpus, so I am not sure if that's intended or we can add more integration tests. Also if those tests will be run as part of pull request CI or not. cc @gante for this question

cc @jmamou

HuggingFaceDocBuilderDev · 2024-12-06T09:57:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jmamou · 2024-12-07T17:57:53Z

src/transformers/generation/utils.py

@@ -4290,6 +4290,7 @@ def _assisted_decoding(
                    dim=0,
                )

+            candidate_input_ids = candidate_input_ids.to(self.device)


in a previous PR https://github.com/keyboardAnt/transformers/pull/4/files I proposed to do it just after get_candidates at the same place we move candidate_logits to self.device

oke, might be better to move it there for easier readability :)

…ulti-gpu

ArthurZucker

Thanks! Regarding multi-gpu, we don't have fast tests, but we could add them with github actions.

jmamou · 2024-12-10T09:39:37Z

@zucchini-nlp @ArthurZucker
For AG with multi-GPU, it would be beneficial to include in the documentation that for optimal speedup, the assistant model should be placed on the default/first device of the target model (target.device). This setup avoids the overhead of transferring the candidate token IDs tensor from assistant.device to target.device after the speculative iteration and back to assistant.device with the validated token IDs following target validation

zucchini-nlp · 2024-12-10T09:45:55Z

@jmamou yes, feel free to add it in the docs in subsequent open PRs if relevant

fix

9954c04

zucchini-nlp requested a review from ArthurZucker December 6, 2024 09:30

jmamou mentioned this pull request Dec 6, 2024

Refactoring AssistedCandidateGenerator for Improved Modularity and Reusability #35009

Merged

4 tasks

jmamou reviewed Dec 7, 2024

View reviewed changes

zucchini-nlp added 2 commits December 8, 2024 13:17

move a few lines up

b0df411

Merge remote-tracking branch 'upstream/main' into assisted-decoding-m…

e8b87a6

…ulti-gpu

ArthurZucker approved these changes Dec 10, 2024

View reviewed changes

zucchini-nlp merged commit 0938b57 into huggingface:main Dec 10, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Assisted decoding multi-gpu #35116

Assisted decoding multi-gpu #35116

Uh oh!

zucchini-nlp commented Dec 6, 2024 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Dec 6, 2024

Uh oh!

jmamou Dec 7, 2024

Uh oh!

zucchini-nlp Dec 8, 2024

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

jmamou commented Dec 10, 2024

Uh oh!

zucchini-nlp commented Dec 10, 2024

Uh oh!

Uh oh!

Assisted decoding multi-gpu #35116

Assisted decoding multi-gpu #35116

Uh oh!

Conversation

zucchini-nlp commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Dec 6, 2024

Uh oh!

jmamou Dec 7, 2024

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 8, 2024

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmamou commented Dec 10, 2024

Uh oh!

zucchini-nlp commented Dec 10, 2024

Uh oh!

Uh oh!

zucchini-nlp commented Dec 6, 2024 •

edited

Loading