-
Notifications
You must be signed in to change notification settings - Fork 6.2k
Avoid DtoH sync from access of nonzero() item in scheduler #11696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a cool discovery! We discussed in the past about the sync that occurs for first step and it's due to the timestep lookup to figure out the starting index. For text-to-X pipelines, the self.scheduler.set_begin_index(0)
solution looks good and probably something we can propagate to all pipelines. For other pipelines that support things like denoising strength (for example, img-to-img/vid-to-vid task), or more specifically any model that starts denoising at a custom timestep index instead of 0, we will require doing the lookup though (atleast with current design).
custom timesteps uses diffusers/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py Line 659 in f3e0911
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
yes agree we should apply this change to all text2image pipelines, test-wise, as long as the current test pass we are fine, I think this fix is sufficient for all pipelines though, if not, we can look at them case by case |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@yiyixuxu I think there are many instances where we don't do it the same way as done in SDXL. These will incur the extra cost from
|
@a-r-r-o-w yeah, i know, we can fix them though. I think unless you need to look up on every step, it should work by setting a index in the beginning |
@jbschlosser @a-r-r-o-w |
@yiyixuxu thank you very much for the treatment of this PR. Appreciate that. |
What does this PR do?
(discussed with @sayakpaul in Slack) I have been optimizing inference performance for the Flux model and saw a DtoH sync point in the performance trace coming from the use of nonzero() followed by item() in the scheduler's
index_for_timestep()
logic:diffusers/src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py
Lines 351 to 363 in b0f7036
This sync causes a minor but non-negligible gap in GPU utilization during the first timestep, especially when torch.compile is utilized (due to cpu-side Dynamo cache lookup overhead after the sync point), as shown here:
AFAICT this can be avoided by manually calling
scheduler.set_begin_index(0)
, so this PR does that forFluxPipeline
. Locally, I was able to see the sync point go away after this change.Insights welcome regarding:
FluxPipeline
?