Skip to content

Avoid DtoH sync from access of nonzero() item in scheduler #11696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 11, 2025

Conversation

jbschlosser
Copy link
Contributor

@jbschlosser jbschlosser commented Jun 11, 2025

What does this PR do?

(discussed with @sayakpaul in Slack) I have been optimizing inference performance for the Flux model and saw a DtoH sync point in the performance trace coming from the use of nonzero() followed by item() in the scheduler's index_for_timestep() logic:

def index_for_timestep(self, timestep, schedule_timesteps=None):
if schedule_timesteps is None:
schedule_timesteps = self.timesteps
indices = (schedule_timesteps == timestep).nonzero()
# The sigma index that is taken for the **very** first `step`
# is always the second index (or the last index if there is only 1)
# This way we can ensure we don't accidentally skip a sigma in
# case we start in the middle of the denoising schedule (e.g. for image-to-image)
pos = 1 if len(indices) > 1 else 0
return indices[pos].item()

This sync causes a minor but non-negligible gap in GPU utilization during the first timestep, especially when torch.compile is utilized (due to cpu-side Dynamo cache lookup overhead after the sync point), as shown here:

trace_with_nonzero2

AFAICT this can be avoided by manually calling scheduler.set_begin_index(0), so this PR does that for FluxPipeline. Locally, I was able to see the sync point go away after this change.

Insights welcome regarding:

  • A more robust alternative fix
  • The best way to test this change (looks like test_pipeline_flux.py is a good place?)
  • Should this fix be generalized beyond FluxPipeline?

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu June 11, 2025 18:28
Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a cool discovery! We discussed in the past about the sync that occurs for first step and it's due to the timestep lookup to figure out the starting index. For text-to-X pipelines, the self.scheduler.set_begin_index(0) solution looks good and probably something we can propagate to all pipelines. For other pipelines that support things like denoising strength (for example, img-to-img/vid-to-vid task), or more specifically any model that starts denoising at a custom timestep index instead of 0, we will require doing the lookup though (atleast with current design).

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Jun 11, 2025

@a-r-r-o-w

custom timesteps uses set_begin_index too

self.scheduler.set_begin_index(t_start * self.scheduler.order)

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@yiyixuxu
Copy link
Collaborator

yes agree we should apply this change to all text2image pipelines, test-wise, as long as the current test pass we are fine,
unless we want to add additional torch compile related test

I think this fix is sufficient for all pipelines though, if not, we can look at them case by case

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w
Copy link
Member

a-r-r-o-w commented Jun 11, 2025

@yiyixuxu I think there are many instances where we don't do it the same way as done in SDXL. These will incur the extra cost from nonzero().

@yiyixuxu
Copy link
Collaborator

@a-r-r-o-w yeah, i know, we can fix them though. I think unless you need to look up on every step, it should work by setting a index in the beginning

@yiyixuxu
Copy link
Collaborator

@jbschlosser @a-r-r-o-w
I'm merging this now as I understand this is needed pretty urgently
we can do a follow-up PR for the rest of pipelines :)

@yiyixuxu yiyixuxu merged commit b272807 into huggingface:main Jun 11, 2025
12 checks passed
@sayakpaul
Copy link
Member

@yiyixuxu thank you very much for the treatment of this PR. Appreciate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants