Ai2’s Tulu 3 405B model outperforms DeepSeek V3 and OpenAI’s GPT-4o in key AI benchmarks, marking a significant achievement for U.S.-developed, open-source AI models.
Article – Key Points
- Ai2’s breakthrough model: Ai2, a nonprofit AI research institute based in Seattle, launched its Tulu 3 405B model, which outperforms DeepSeek’s V3 and OpenAI’s GPT-4o in certain benchmarks. The Tulu 3 model builds on Llama 3 with significant improvements via post-training using Ai2’s Reinforcement Learning from Verifiable Rewards (RLVR) technique.
- Tulu 3 405B’s performance: The model contains 405 billion parameters, requiring 256 GPUs running in parallel for training. Parameters are critical for a model’s problem-solving ability, with larger models generally offering better performance. Despite its size, Tulu 3 405B’s post-training is highly optimized, making it more efficient than its competitors.
- Open-source advantage: Tulu 3 405B is open-source and permissively licensed, meaning its components can be freely accessed and replicated, distinguishing it from GPT-4o and DeepSeek V3, which have more restrictive licensing models. This open-source nature supports broader AI development and encourages transparency.
- Benchmark achievements: Tulu 3 405B achieved top scores on PopQA (14,000 knowledge-based questions sourced from Wikipedia) and GSM8K (a set of grade-school math problems), outperforming DeepSeek V3, GPT-4o, and Meta’s Llama 3.1 405B. It also bested DeepSeek V3 and GPT-4o in the HumanEval benchmark, showcasing its competitive edge.
- Reinforcement learning with verifiable rewards: Ai2 utilized reinforcement learning with verifiable rewards (RLVR) to enhance Tulu 3 405B’s accuracy, especially in tasks with clear outcomes like math problem-solving. The model’s performance on tasks like GSM8K highlights the effectiveness of RLVR in improving real-world application accuracy.
- Access to the model: Tulu 3 405B is available for testing through Ai2’s chatbot web app, with the model’s training code available on GitHub and Hugging Face. This accessibility enables AI developers to experiment with and fine-tune the model for diverse applications.
- Model’s comparative edge: Tulu 3 405B is built on the Llama 3.0 architecture, and although it requires a massive computing infrastructure (256 GPUs), its open-source status and robust post-training process make it a strong competitor in the AI space. The model is 40% smaller than DeepSeek V3 while still surpassing it in several benchmark tests, emphasizing the potential for efficiency in AI development.
- AI development trends: The launch of Tulu 3 405B highlights a broader trend toward open-source AI models in the U.S., aiming to challenge dominant players like OpenAI and Chinese AI firms such as DeepSeek. This shift could democratize AI development, making cutting-edge models more accessible to researchers and developers globally.
- New insights from the AI model landscape:
- Mistral Small 3: Mistral AI also released the Mistral Small 3, a 24B parameter model, praised for its low latency and efficiency, designed for fast deployment. It has gained attention for its ability to run 3x faster than competitors and its capability to execute locally with Apache 2.0 licensing.
- Sakana AI’s TinySwallow-1.5B: A small Japanese language model achieving state-of-the-art performance, demonstrating the potential for small models in specific domains.
- Alibaba’s Qwen 2.5 Max: A large AI model outperforming competitors such as DeepSeek V3 and GPT-4o, emphasizing advancements in massive-scale AI models from non-Western firms.
Why This Matters:
- U.S. leadership in AI: The success of Tulu 3 405B underscores the potential for the U.S. to lead in competitive, open-source AI, challenging traditional tech giants and companies like DeepSeek from China. It supports the growing movement towards open-source models that are accessible and customizable.
- Open-source shift: With open-source models like Tulu 3 405B becoming more accessible, AI development could see a shift toward greater transparency, allowing broader participation from developers and researchers worldwide. This could lead to more innovation and faster progress in AI research.
- Benchmarking for future AI: Tulu 3 405B sets a high bar for future models, providing a benchmark for competitive performance in specialized knowledge and math-related tasks. Its open-source release encourages experimentation and the adaptation of its underlying technology for new applications across industries.
The AI landscape is increasingly defined by the contrasting approaches of open source and closed source models. This article examines the nuances of each approach, exploring their benefits, challenges, and implications for businesses, developers, and the future of AI.
Read a comprehensive monthly roundup of the latest AI news!