TAEM1 is a Tiny AutoEncoder for the Mochi 1 (Preview) video generation model. TAEM1 should be able to encode/decode Mochi's latents more cheaply than the full-size Mochi VAE (at the cost of slightly lower quality). This means TAEM1 should be useful for previewing outputs from Mochi 1.
Sample Video | Reconstruction with TAEM1 |
---|---|
TAEM1 consists of an MSE-distilled encoder and an MSE+adversarial-distilled decoder, both trained to mimic the Mochi 1 VAE behavior.
TAEM1 has the vae_latents_to_dit_latents
and dit_latents_to_vae_latents
transforms baked in, and consumes/produces [0, 1]-scaled images, so TAEM1 shouldn't require much additional code to use.
TAEM1 is causal (like the Mochi 1 VAE) so you can either run TAEM1 timestep-parallel (faster, higher memory usage) or timestep-sequential (slower, reduced memory usage).
You can try running python3 taem1.py test_video.mp4
to test reconstruction. Mochi T2V demo notebook TBD.