Meta Unveils V-JEPA 2, an AI "World Model" for Robotics

Meta’s V-JEPA 2 is a 1.2B-parameter, open-source AI world model trained on over 1 million hours of video. It enables robots and AI agents to understand, predict, and plan physical interactions in unfamiliar real-world environments. By outperforming rivals like Nvidia’s Cosmos and introducing standardized benchmarks for physical reasoning, V-JEPA 2 strengthens Meta’s pursuit of Advanced Machine Intelligence (AMI)—AI that sees, reasons, and acts more like humans.

Meta Unveils V-JEPA 2 – Key Points

Launch of V-JEPA 2: A Next-Gen World Model
Released on June 11, 2025, Meta’s V-JEPA 2 (Video Joint Embedding Predictive Architecture 2) is described as its most advanced world model yet. Trained on over 1 million hours of video and 1 million images, it learns how objects and people interact, allowing AI to anticipate physical events in the same intuitive way that humans and animals do—such as predicting where a bouncing ball will land.
Self-Supervised, Two-Phase Training for Generalization
V-JEPA 2 is trained via:
1. Actionless pre-training: Learning from raw video to understand motion, object interaction, and causality.
2. Action-conditioned fine-tuning: Using just 62 hours of robot control data, V-JEPA 2 learns to model outcomes based on specific actions.
  This enables zero-shot robot planning in unseen environments using only visual inputs—no retraining required per robot or task.
Core Architecture: Encoder + Predictor for Forward Simulation
The encoder transforms video into semantic embeddings of the world state, while the predictor forecasts future states based on current conditions and candidate actions. The system supports both short-horizon (e.g., placing objects) and long-horizon tasks (e.g., sequencing actions via subgoals), achieving 65–80% success rates in lab tests with unfamiliar objects.
Advanced Performance & Speed vs. Competition
Meta reports that V-JEPA 2 is 30x faster than Nvidia’s Cosmos in physical reasoning tasks. It achieves state-of-the-art results on:
- Something-Something v2 (action recognition)
- Epic-Kitchens-100 (action anticipation)
- Perception Test & TempCompass (video Q&A)
  These gains support V-JEPA 2’s broader usability in robotics and embodied AI.
New Benchmarks for Evaluating Physical Reasoning
Meta released three standardized benchmarks for the community:
- IntPhys 2: Measures whether AI can detect physics violations (e.g., implausible object behavior).
- MVPBench: Uses paired video Q&A to test consistency and avoid shortcut learning.
- CausalVQA: Assesses causal inference, counterfactual reasoning, and future event prediction.
  These tests show humans still outperform AI, highlighting the remaining gap in intuitive physics understanding.
Real-World Robotic Testing at Meta
Meta has successfully tested V-JEPA 2 in lab robots performing pick-and-place tasks with previously unseen objects and unfamiliar settings. Robots use a visual goal image and re-plan step-by-step actions using model-predictive control. This testing shows that V-JEPA 2 allows practical deployment of AI with human-like foresight, without needing environment-specific training.
Strategic Context: Meta’s AMI Vision and Open Access
V-JEPA 2 plays a central role in Meta’s goal to build Advanced Machine Intelligence (AMI)—AI that doesn’t just analyze data but learns dynamically from the world. The model, code, and training tools are publicly available via GitHub, Hugging Face, and Meta AI, reinforcing its support for open-source research collaboration.
Future Research: Multimodal and Hierarchical Models
Meta plans to develop hierarchical world models capable of planning across multiple time scales (e.g., complex tasks like cooking). Future versions aim to integrate multimodal inputs—including vision, touch, and sound—to further close the gap between AI and human-like reasoning.
Broader Industry Landscape
Meta joins a growing race alongside Google DeepMind (Genie), Fei-Fei Li’s World Labs, and Nvidia Cosmos to build general-purpose world models. V-JEPA 2 differentiates itself through speed, zero-shot performance, and a commitment to shared benchmarks—driving the field closer to real-world autonomous intelligence.

Why This Matters:

V-JEPA 2 advances the frontier of AI by replicating a critical human trait: understanding how the world responds to actions. Its ability to plan and interact in real-world environments—without extensive data or retraining—lays the foundation for scalable robotics, personal assistants, and intelligent systems across sectors. Combined with Meta’s open-source strategy and benchmark standardization, V-JEPA 2 is helping transform physical AI from experimental labs into everyday tools that can reason, react, and adapt like humans.