Skip to content

What happens if Stage 1 pretraining is skipped and training starts directly from Stage 2? #9

@wyhlovecpp

Description

@wyhlovecpp

Thank you for open-sourcing UniWorld — it's an impressive and elegant design!

I have a question regarding your two-stage training procedure. In Section 3.2 of the paper, you mention that Stage 1 is used to align the VLM features with the FLUX text branch via a frozen setup, and that skipping proper alignment (e.g., introducing T5 features too early) tends to cause the model to converge to poor local minima.

I’m curious:
Have you tried skipping Stage 1 and training the entire architecture directly from Stage 2 (i.e., unfreezing all parameters from the start)? Would such a strategy lead to divergence or collapse (e.g., NaNs, trivial solutions, or failure to use SigLIP features effectively)? Or would the model still converge, albeit to a suboptimal solution?

Understanding this would be very helpful for ablation or simplification purposes.

Thanks again for your great work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions