Ai2's New Open-Source Model Tulu 3 405B Outperforms DeepSeek and GPT-4o

Ai2’s Tulu 3 405B model outperforms DeepSeek V3 and OpenAI’s GPT-4o in key AI benchmarks, marking a significant achievement for U.S.-developed, open-source AI models.

Launch of Ai2’s Tulu 3 405B model – Key Points

Ai2’s breakthrough model: The Allen Institute for AI (Ai2), a nonprofit AI research institute based in Seattle, launched its Tulu 3 405B model, which outperforms DeepSeek’s V3 and OpenAI’s GPT-4o in certain benchmarks. The Tulu 3 model builds on Llama 3 with significant improvements via post-training using Ai2’s Reinforcement Learning from Verifiable Rewards (RLVR) technique. Notably, the model also incorporates Direct Preference Optimization (DPO) and supervised fine-tuning (SFT), leveraging a combination of public datasets and synthetic data to enhance its performance.
Tulu 3 405B’s performance: The model contains 405 billion parameters, requiring 256 GPUs running in parallel for training. Parameters are critical for a model’s problem-solving ability, with larger models generally offering better performance. Despite its size, Tulu 3 405B’s post-training is highly optimized, making it more efficient than its competitors. The integration of RLVR, DPO, and SFT ensures that the model achieves high accuracy across diverse tasks, particularly in areas with clear outcomes like math problem-solving.
Open-source advantage: Tulu 3 405B is open-source and permissively licensed, meaning its components can be freely accessed and replicated, distinguishing it from GPT-4o and DeepSeek V3, which have more restrictive licensing models. This open-source nature supports broader AI development and encourages transparency, enabling researchers and developers to experiment with and adapt the model for various applications.
Benchmark achievements: Tulu 3 405B achieved top scores on PopQA (14,000 knowledge-based questions sourced from Wikipedia) and GSM8K (a set of grade-school math problems), outperforming DeepSeek V3, GPT-4o, and Meta’s Llama 3.1 405B. It also bested DeepSeek V3 and GPT-4o in the HumanEval benchmark, showcasing its competitive edge. The model’s ability to excel in these benchmarks highlights the effectiveness of its training techniques, including RLVR and DPO.
Reinforcement learning with verifiable rewards: Ai2 utilized reinforcement learning with verifiable rewards (RLVR) to enhance Tulu 3 405B’s accuracy, especially in tasks with clear outcomes like math problem-solving. The model’s performance on tasks like GSM8K highlights the effectiveness of RLVR in improving real-world application accuracy. Additionally, the use of DPO ensures that the model aligns more closely with human preferences, further boosting its performance in practical scenarios.
Access to the model: Tulu 3 405B is available for testing through Ai2’s chatbot web app, with the model’s training code available on GitHub and Hugging Face. This accessibility enables AI developers to experiment with and fine-tune the model for diverse applications, fostering innovation and collaboration in the AI community.
Model’s comparative edge: Tulu 3 405B is built on the Llama 3.0 architecture, and although it requires a massive computing infrastructure (256 GPUs), its open-source status and robust post-training process make it a strong competitor in the AI space. The model is 40% smaller than DeepSeek V3 while still surpassing it in several benchmark tests, emphasizing the potential for efficiency in AI development. Its integration of RLVR, DPO, and SFT further sets it apart from competitors, offering a more versatile and accurate solution.
AI development trends: The launch of Tulu 3 405B highlights a broader trend toward open-source AI models in the U.S., aiming to challenge dominant players like OpenAI and Chinese AI firms such as DeepSeek. This shift could democratize AI development, making cutting-edge models more accessible to researchers and developers globally. The use of synthetic data and advanced training techniques like RLVR and DPO also points to a future where AI models are more adaptable and aligned with human preferences.
New insights from the AI model landscape:
- Mistral Small 3: Mistral AI also released the Mistral Small 3, a 24B parameter model, praised for its low latency and efficiency, designed for fast deployment. It has gained attention for its ability to run 3x faster than competitors and its capability to execute locally with Apache 2.0 licensing.
- Sakana AI’s TinySwallow-1.5B: A small Japanese language model achieving state-of-the-art performance, demonstrating the potential for small models in specific domains.
- Alibaba’s Qwen 2.5 Max: A large AI model outperforming competitors such as DeepSeek V3 and GPT-4o, emphasizing advancements in massive-scale AI models from non-Western firms.

Why This Matters:

U.S. leadership in AI: The success of Tulu 3 405B underscores the potential for the U.S. to lead in competitive, open-source AI, challenging traditional tech giants and companies like DeepSeek from China. It supports the growing movement towards open-source models that are accessible and customizable, fostering innovation and collaboration in the AI community.
Open-source shift: With open-source models like Tulu 3 405B becoming more accessible, AI development could see a shift toward greater transparency, allowing broader participation from developers and researchers worldwide. This could lead to more innovation and faster progress in AI research, particularly as advanced training techniques like RLVR and DPO become more widely adopted.
Benchmarking for future AI: Tulu 3 405B sets a high bar for future models, providing a benchmark for competitive performance in specialized knowledge and math-related tasks. Its open-source release encourages experimentation and the adaptation of its underlying technology for new applications across industries. The model’s success also highlights the importance of combining multiple training techniques, such as RLVR, DPO, and SFT, to achieve state-of-the-art performance.
Synthetic data and advanced training techniques: The use of synthetic data and advanced training techniques like RLVR and DPO in Tulu 3 405B points to a future where AI models are more adaptable, efficient, and aligned with human preferences. These innovations could pave the way for more robust and versatile AI systems, capable of tackling a wider range of tasks with greater accuracy and reliability.

The Big AI Divide: Open Source vs. Closed Source AI

The AI landscape is increasingly defined by the contrasting approaches of open source and closed source models. This article examines the nuances of each approach, exploring their benefits, challenges, and implications for businesses, developers, and the future of AI.

Read a comprehensive monthly roundup of the latest AI news!