Skip to content

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Dec 6, 2024

What does this PR do?

Fixes #35099, we need to manually move inputs to the correct device as assistant and target models can be initialized on different devices. We already move the input to assistant's device when get_candidate_inputs but don't do it in target models' forward

Btw, seems like we are now having more troubles with certain generation techniques or cache not working in multi-gpu setting. And we don't have much tests for multiple gpus, so I am not sure if that's intended or we can add more integration tests. Also if those tests will be run as part of pull request CI or not. cc @gante for this question

cc @jmamou

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -4290,6 +4290,7 @@ def _assisted_decoding(
dim=0,
)

candidate_input_ids = candidate_input_ids.to(self.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a previous PR https://github.com/keyboardAnt/transformers/pull/4/files I proposed to do it just after get_candidates at the same place we move candidate_logits to self.device

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oke, might be better to move it there for easier readability :)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Regarding multi-gpu, we don't have fast tests, but we could add them with github actions.

@zucchini-nlp zucchini-nlp merged commit 0938b57 into huggingface:main Dec 10, 2024
24 checks passed
@jmamou
Copy link
Contributor

jmamou commented Dec 10, 2024

@zucchini-nlp @ArthurZucker
For AG with multi-GPU, it would be beneficial to include in the documentation that for optimal speedup, the assistant model should be placed on the default/first device of the target model (target.device). This setup avoids the overhead of transferring the candidate token IDs tensor from assistant.device to target.device after the speculative iteration and back to assistant.device with the validated token IDs following target validation

@zucchini-nlp
Copy link
Member Author

@jmamou yes, feel free to add it in the docs in subsequent open PRs if relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running AG and SD when assistant and target models are on different devices
4 participants