You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for open-sourcing UniWorld — it's an impressive and elegant design!
I have a question regarding your two-stage training procedure. In Section 3.2 of the paper, you mention that Stage 1 is used to align the VLM features with the FLUX text branch via a frozen setup, and that skipping proper alignment (e.g., introducing T5 features too early) tends to cause the model to converge to poor local minima.
I’m curious:
Have you tried skipping Stage 1 and training the entire architecture directly from Stage 2 (i.e., unfreezing all parameters from the start)? Would such a strategy lead to divergence or collapse (e.g., NaNs, trivial solutions, or failure to use SigLIP features effectively)? Or would the model still converge, albeit to a suboptimal solution?
Understanding this would be very helpful for ablation or simplification purposes.