ActionParty: First Multi-Agent Video World Model Controls Seven Players Simultaneously
Researchers have introduced ActionParty, the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments, tackling a fundamental limitation of existing video diffusion models.
The Problem
Current "world model" video systems are largely restricted to single-agent settings. They fail to control multiple agents simultaneously because of an action binding problem — the model struggles to associate specific actions with their corresponding subjects.
The Solution
ActionParty introduces subject state tokens — persistent latent variables that capture the state of each subject in the scene. Combined with a spatial biasing mechanism, it disentangles:
- Global video frame rendering (the overall scene)
- Individual action-controlled subject updates (each player's behavior)
Key Results
| Metric | Achievement |
|---|---|
| Max simultaneous players | 7 |
| Test environments | 46 (Melting Pot benchmark) |
| Action-following accuracy | Significant improvement |
| Identity consistency | Robust through interactions |
| Autoregressive tracking | Complex interactions handled |
How It Works
- Each subject gets persistent state tokens in latent space
- Spatial biasing mechanism routes actions to correct subjects
- Video diffusion generates frames respecting all subject states simultaneously
- Autoregressive tracking maintains identity through interactions
Implications
- Gaming: AI-controlled NPCs with individual agency in shared worlds
- Simulation: Multi-agent environments for training and research
- Content creation: Complex multi-character video generation
- Robotics: Multiple robot coordination in shared spaces
Authors
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati, Aliaksandr Siarohin.
Paper: arXiv:2604.02330 | Project: action-party.github.io