What happens if Stage 1 pretraining is skipped and training starts directly from Stage 2?

Thank you for open-sourcing UniWorld — it's an impressive and elegant design!

I have a question regarding your two-stage training procedure. In Section 3.2 of the paper, you mention that Stage 1 is used to align the VLM features with the FLUX text branch via a frozen setup, and that skipping proper alignment (e.g., introducing T5 features too early) tends to cause the model to converge to poor local minima.

I’m curious:
Have you tried skipping Stage 1 and training the entire architecture directly from Stage 2 (i.e., unfreezing all parameters from the start)? Would such a strategy lead to divergence or collapse (e.g., NaNs, trivial solutions, or failure to use SigLIP features effectively)? Or would the model still converge, albeit to a suboptimal solution?

Understanding this would be very helpful for ablation or simplification purposes.

Thanks again for your great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happens if Stage 1 pretraining is skipped and training starts directly from Stage 2? #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What happens if Stage 1 pretraining is skipped and training starts directly from Stage 2? #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions