-
Notifications
You must be signed in to change notification settings - Fork 30.4k
Assisted decoding multi-gpu #35116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assisted decoding multi-gpu #35116
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
src/transformers/generation/utils.py
Outdated
@@ -4290,6 +4290,7 @@ def _assisted_decoding( | |||
dim=0, | |||
) | |||
|
|||
candidate_input_ids = candidate_input_ids.to(self.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a previous PR https://github.com/keyboardAnt/transformers/pull/4/files I proposed to do it just after get_candidates
at the same place we move candidate_logits
to self.device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oke, might be better to move it there for easier readability :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Regarding multi-gpu, we don't have fast tests, but we could add them with github actions.
@zucchini-nlp @ArthurZucker |
@jmamou yes, feel free to add it in the docs in subsequent open PRs if relevant |
What does this PR do?
Fixes #35099, we need to manually move inputs to the correct device as assistant and target models can be initialized on different devices. We already move the input to assistant's device when
get_candidate_inputs
but don't do it in target models' forwardBtw, seems like we are now having more troubles with certain generation techniques or cache not working in multi-gpu setting. And we don't have much tests for multiple gpus, so I am not sure if that's intended or we can add more integration tests. Also if those tests will be run as part of pull request CI or not. cc @gante for this question
cc @jmamou