MeiGen's MultiTalk: Let them talk - Audio-driven multi-person conversational video generation

This repository contains a Cog implementation of MultiTalk, MeiGen's audio-driven multi-person conversational video generation system. This isn't just another talking head generator—MultiTalk creates realistic multi-person conversations, complete with synchronized lip movements, natural interactions, and even supports singing and cartoon characters. It's like having a virtual film studio that can bring any conversation to life.

MultiTalk takes multi-stream audio inputs, a reference image, and a text prompt, then generates videos where people actually interact with each other following the conversation flow, with precise lip synchronization that puts traditional dubbing to shame.

What makes MultiTalk special:

🎭 Multi-person conversations: Generate realistic conversations between multiple people, not just single talking heads
🎤 Perfect lip sync: Audio-driven generation with accurate lip synchronization
👥 Interactive control: Direct virtual humans through natural language prompts
🎨 Versatile characters: Works with real people, cartoon characters, and even singing performances
📺 High quality output: 480p and 720p generation at arbitrary aspect ratios
⏱️ Long-form content: Generate videos up to 15 seconds with consistent quality

Model links and information:

Original Project: MeiGen-AI/MultiTalk
Research Paper: Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Project Website: meigen-ai.github.io/multi-talk
Model Weights: MeiGen-AI/MeiGen-MultiTalk
Base Model: Wan-AI/Wan2.1-I2V-14B-480P
This Cog packaging by: zsxkib on GitHub / @zsakib_ on Twitter

Prerequisites

Docker: You'll need Docker to build and run the Cog container. Install Docker.
Cog: Cog is required to build and run this model locally. Install Cog.
NVIDIA GPU: You'll need a NVIDIA GPU with at least 24GB of memory (A100, H100, or RTX 4090+ recommended) for the best performance.

Run locally: Conversations come to life

Running MultiTalk with Cog is straightforward. The system automatically handles model downloads, audio processing, and video generation—just provide your audio files and reference image, and watch as realistic conversations unfold before your eyes.

Clone this repository:

git clone https://github.com/zsxkib/cog-MultiTalk.git
cd cog-MultiTalk

Run the model: The first time you run any command, Cog will download the model weights (~30GB total), but after that initial setup, generation is fast.

Single-person talking video:

# Generate a single person speaking
cog predict \
  -i image=@person.jpg \
  -i first_audio=@speech.wav \
  -i prompt="A professional speaker giving a presentation"

# Create a singing performance
cog predict \
  -i image=@singer.jpg \
  -i first_audio=@song.wav \
  -i prompt="A talented singer performing an emotional ballad" \
  -i num_frames=161

# Cartoon character speaking
cog predict \
  -i image=@cartoon.jpg \
  -i first_audio=@dialogue.wav \
  -i prompt="An animated character telling an exciting story"

Multi-person conversations (This is where MultiTalk truly shines):

# Two people having a conversation
cog predict \
  -i image=@two_people.jpg \
  -i first_audio=@person1_speech.wav \
  -i second_audio=@person2_speech.wav \
  -i prompt="Two friends having an animated discussion about their favorite movies"

# Podcast-style conversation
cog predict \
  -i image=@podcast_setup.jpg \
  -i first_audio=@host_audio.wav \
  -i second_audio=@guest_audio.wav \
  -i prompt="A smiling man and woman wearing headphones sit in front of microphones, appearing to host a podcast" \
  -i num_frames=181

# Interview scenario
cog predict \
  -i image=@interview.jpg \
  -i first_audio=@interviewer.wav \
  -i second_audio=@interviewee.wav \
  -i prompt="A professional interview taking place in a modern office setting"

Advanced generation control:

# High-quality long-form generation
cog predict \
  -i image=@speakers.jpg \
  -i first_audio=@long_speech.wav \
  -i prompt="A confident speaker delivering an important presentation" \
  -i num_frames=201 \
  -i sampling_steps=50 \
  -i turbo=false \
  -i seed=42

# Fast generation with turbo mode
cog predict \
  -i image=@quick_demo.jpg \
  -i first_audio=@short_audio.wav \
  -i prompt="A person giving a quick demo" \
  -i sampling_steps=20 \
  -i turbo=true

# Reproducible results with fixed seed
cog predict \
  -i image=@test_subject.jpg \
  -i first_audio=@test_audio.wav \
  -i prompt="A test subject for video generation experiments" \
  -i seed=123456 \
  -i num_frames=81

How it works

This Cog implementation faithfully reproduces the original MultiTalk research pipeline with several optimizations for production use. Here's what happens under the hood:

setup() method: When the container starts up:
1. Downloads the complete MultiTalk model stack from Replicate's CDN (~30GB total):
  - Wan2.1-I2V-14B-480P: The 14 billion parameter base video generation model
  - chinese-wav2vec2-base: Audio encoder for speech feature extraction
  - MeiGen-MultiTalk: Custom audio conditioning weights trained for conversational video
2. Sets up GPU optimizations based on available memory (A100/H100 get the best performance settings)
3. Sets up the audio processing pipeline
4. Sets up the video generation pipeline
predict() method: Here's what happens:
1. Audio processing: Extracts audio from video files if needed, normalizes loudness, and handles both single and multi-person scenarios
2. Feature extraction: Uses the audio encoder to convert speech into data that captures timing and emotional content
3. Multi-person coordination: For conversations, combines multiple audio streams while keeping them aligned
4. Video generation: The 14 billion parameter model generates frames based on both the reference image and audio data
5. Sampling: Uses acceleration techniques for quality/speed balance
6. Post-processing: Combines generated video with original audio for synchronization

MultiTalk's key innovation is its ability to understand conversational dynamics—it doesn't just make mouths move, it generates natural interactions between people that follow the flow and emotional content of the conversation.

Why MultiTalk is different

Traditional talking head generators can only animate single speakers with basic lip movements. MultiTalk changes this by:

Understanding conversations: It grasps the back-and-forth nature of human dialogue and generates appropriate visual responses
Multi-person awareness: Handles complex scenarios where multiple people interact naturally
Audio-visual coherence: Creates synchronization not just of lip movements, but of facial expressions and body language that match the audio's emotional content
Versatility: Works across different types of content—serious conversations, casual chats, singing, even cartoon characters

The research shows that MultiTalk can generate up to 15-second videos with consistent character appearance and natural interaction patterns that would previously require expensive motion capture and professional video production.

Performance optimizations

This Cog implementation includes several performance optimizations:

Automatic memory detection: Optimizes settings based on your GPU's memory capacity
Turbo mode: Faster generation with optimized sampling parameters
Acceleration: Speeds up inference by 2-3x with minimal quality loss
Smart frame adjustment: Automatically corrects frame counts to valid values (4n+1 format)
GPU memory management: Efficient cleanup between runs for consistent performance

Deploy to Replicate

Want to share MultiTalk with the world? Push it to Replicate:

cog login
cog push r8.im/your-username/multitalk

License

This implementation follows the original MultiTalk project's Apache 2.0 license. The MultiTalk model and research are from MeiGen-AI.

⭐ Star this on GitHub!

👋 Follow zsakib_ on Twitter/X

Enjoying MultiTalk? Check out the original project and give the MeiGen team some love: github.com/MeiGen-AI/MultiTalk

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
wan		wan
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
generate_multitalk.py		generate_multitalk.py
predict.py		predict.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MeiGen's MultiTalk: Let them talk - Audio-driven multi-person conversational video generation

Prerequisites

Run locally: Conversations come to life

How it works

Why MultiTalk is different

Performance optimizations

Deploy to Replicate

License

About

Uh oh!

Releases

Packages

Languages

License

zsxkib/cog-MultiTalk

Folders and files

Latest commit

History

Repository files navigation

MeiGen's MultiTalk: Let them talk - Audio-driven multi-person conversational video generation

Prerequisites

Run locally: Conversations come to life

How it works

Why MultiTalk is different

Performance optimizations

Deploy to Replicate

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages