Meta Releases SAM 3D Reconstruction Model from 2D Images

Key Takeaway

Meta introduces SAM 3, a next-generation system that links language and vision through open-vocabulary prompts, and SAM 3D, which turns single images into 3D reconstructions. Together they push toward flexible multimodal perception and practical 3D understanding for both researchers and creators, accessible through open resources and the Segment Anything Playground.

Meta Releases SAM 3D – Key Points

SAM 3 expands segmentation into open vocabulary (November 2025)

SAM 3 interprets free-form text (“striped red umbrella”) or example images to find matching concepts in photos and videos, moving beyond fixed label sets. It accepts noun phrases, reference images, and visual prompts (masks, boxes, points), and works as a unified system for detection, segmentation, and tracking. Meta releases checkpoints, training resources, and benchmarks, plus the Segment Anything Playground, where anyone can upload media, test SAM 3, and apply creative effects like spotlights, motion trails, and pixelation.

Performance surpasses prior vision models on Meta’s new SA-Co benchmark

On Meta’s internal “Segment Anything with Concepts” (SA-Co) benchmark, SAM 3 roughly doubles the performance of earlier open-vocabulary segmenters and beats models such as OWLv2, GLEE, and Gemini 2.5 Pro. Charts show SAM 3 close to, but still below, human performance on concept segmentation and counting, underlining strong gains yet leaving space for deeper reasoning and finer category distinctions.

Hybrid human–AI training pipeline accelerates dataset creation

Meta’s “data engine” combines SAM 3, Llama-based captioners, and human reviewers. AI suggests masks, which annotators correct, giving around 5× faster work on negative prompts and 36% gains on positive ones. The result is a training set with over 4 million concepts, improving generalization. The loop resembles large language model training: better models generate better candidates, which feed back into the engine, scaling quality and volume without proportionally increasing expert labor.

Early commercial integrations across Meta products

SAM 3 already powers “View in Room” on Facebook Marketplace and will enable new effects in Meta’s creative apps, including Instagram’s Edits and Vibes. Running on Nvidia H200 GPUs, it can segment 100+ objects in a single image in about 30 ms and track around five objects in near real-time video. Marketplace now combines SAM 3 and SAM 3D so buyers can preview both style and 3D fit of home decor in their own rooms. For developers, the Playground doubles as a test environment for annotation and stress tests, while downloadable datasets and code support deeper experiments and fine-tuning.

Known limitations and the SAM 3 Agent concept

SAM 3 struggles with narrow technical vocabularies (for example in medical imaging) and complex logical instructions (“second to last book on the top shelf”). Meta’s answer is a “SAM 3 Agent” that pairs SAM 3 with multimodal language models like Llama or Gemini: the language model parses and plans, while SAM 3 performs precise, pixel-level segmentation, aiming to turn natural-language tasks into reliable visual actions.

SAM 3D extends SAM into 3D reconstruction from single images

SAM 3D introduces SAM 3D Objects and SAM 3D Body to generate 3D meshes from 2D photos and add “common sense” 3D understanding to everyday images. SAM 3D Objects reconstructs objects and scenes from a single masked view, estimating geometry, texture, and layout, and exporting standard formats such as .ply or .obj for use in robotics, games, AR/VR, and interactive media. Because high-quality 3D data is scarce, Meta reuses its data engine: annotators rank multiple candidate meshes and send difficult cases to expert 3D artists. This annotates nearly 1 million real-world images and yields about 3.14 million model-in-the-loop meshes. Meta is also preparing the SAM 3D Artist Objects dataset (SA-3DAO), a benchmark of paired photos and meshes tailored to real-world 3D reconstruction, moving beyond synthetic or staged data.

SAM 3D Body and the Meta Momentum Human Rig (MHR)

SAM 3D Body is trained on about 8 million images chosen from billions, including diverse photos, multi-camera captures, and synthetic content. It uses the open-source Meta Momentum Human Rig (MHR), which separates the skeleton from soft-tissue shape for clearer, more reusable human meshes. The architecture combines a transformer encoder–decoder with a multi-input encoder for fine details and a prompt-aware mesh decoder that can use segmentation masks and 2D keypoints. The system stays robust under occlusions, rare poses, and varied clothing, and outperforms earlier full-body methods on several benchmarks. MHR, already supporting Meta’s Codec Avatars, is licensed for commercial use, positioning SAM 3D Body as a core tool for sports medicine, virtual fashion, telepresence, and any application needing realistic body mechanics.

Current constraints of 3D models

SAM 3D still outputs moderate-resolution meshes, which blurs fine details in complex shapes and full-body reconstructions. SAM 3D Objects predicts objects individually and does not yet model multi-object physics such as contact, stacking, or collisions. SAM 3D Body treats each person separately, limiting its grasp of multi-person and human–object interactions in crowded scenes. Hand pose remains weaker than in specialized hand-only models. Meta plans to raise output resolution, add objectives and architectures that jointly reason over sets of objects or people, and refine hand and fine-structure modeling. At the same time, cloud-based 3D digital asset platforms that convert models into glTF or USDZ and stream them into engines like Unity, Unreal, or WebXR are becoming key companions to SAM 3D, managing storage, optimization, and delivery of the growing volume of 3D content.

SAM 2 context: the evolution that led to SAM 3

SAM 2, released earlier as open source, was trained on the SA-V dataset of 50,900 videos, 642,600 mask annotations, and 35.5 million masks over about 200 hours of footage. It added a memory module for long-sequence tracking, improving robustness across cuts and occlusions and boosting speed and accuracy on video benchmarks. Together with its “SAM-in-the-loop” annotation system, SAM 2 laid the foundations for SAM 3’s open-vocabulary behavior and SAM 3D’s physically grounded reconstructions, marking a clear path from 2D segmentation to richer, promptable 3D perception.

Why This Matters

SAM 3’s open-vocabulary segmentation and SAM 3D’s single-image 3D reconstruction push AI toward a unified view of the visual world, across both flat images and spatial scenes. The hybrid data engine shows how to build much larger, higher-quality datasets without prohibitive cost, while benchmarks like SA-3DAO anchor progress in real-world conditions. For creators and developers, the combination of the Segment Anything Playground, open checkpoints, and modern 3D asset pipelines means a single photo can become an optimized 3D model ready for games, AR/VR, or the web in seconds. These advances strengthen Meta’s role in multimodal AI and unlock new possibilities in AR/VR, robotics, gaming, film, sports medicine, interactive media, e-commerce, and creator tools, where accurate 3D perception and controllable segmentation are fast becoming core infrastructure.

This article was drafted with the assistance of generative AI. All facts and details were reviewed and confirmed by an editor prior to publication.