OpenAI and Paradigm Launch EVMbench Smart Contract AI Benchmark

Key Takeaway:

OpenAI and crypto investment firm Paradigm have jointly released EVMbench, a benchmark revealing that AI agents can now successfully drain funds from vulnerable blockchain smart contracts at a 72.2% success rate — while also serving as a critical defensive tool for the industry.

EVMbench – Key Points

EVMbench tests AI agents across three smart contract security tasks. The benchmark runs agents through three modes: Detect (finding vulnerabilities), Patch (fixing flaws without breaking functionality), and Exploit (executing controlled fund-draining attacks in a sandbox). Its 120 vulnerabilities come from 40 real security audits. Because smart contracts are immutable after deployment, a single undetected flaw can cause irreversible, large-scale financial losses.

The benchmark draws from the Tempo blockchain, co-developed by Paradigm and Stripe. Several scenarios were taken directly from Tempo’s security review — a Layer 1 blockchain built for high-speed stablecoin payments. This grounds EVMbench in one of the fastest-growing areas of on-chain finance.

GPT-5.3-Codex hits a 72.2% exploit success rate — up from just 31.9% six months ago. That near-doubling of capability in half a year, comparing GPT-5.3-Codex to its predecessor GPT-5, is perhaps the most striking finding in the entire benchmark. AI is getting dramatically better at hacking blockchain code, fast.

AI agents are better at attacking than defending. Exploit tasks have a clear finish line — drain the funds — so agents iterate until they succeed. Detection and patching are harder: agents often stop after finding one issue instead of auditing the full codebase, and fixing vulnerabilities without breaking contract logic remains a significant technical challenge.

Two real-world DeFi hacks give the benchmark urgent context. DeFi lending protocol Moonwell was exploited through vulnerable code written with AI assistance. Cross-chain protocol CrossCurve lost approximately $3 million to a smart contract vulnerability. Both incidents occurred around the time of the EVMbench launch and illustrate exactly what is at stake.

Anthropic also concluded AI can independently find smart contract vulnerabilities. In a report published late last year, Anthropic found that AI agents had already reached the capability threshold needed to identify blockchain security flaws on their own — reinforcing the competitive and defensive urgency behind OpenAI’s benchmark.

OpenAI is committing $10M in API credits to cyber defense. Expanding its 2023 Cybersecurity Grant Program, OpenAI is directing resources toward open-source software and critical infrastructure protection. It is also expanding Aardvark, its dedicated security research AI agent, and offering free codebase scanning to open-source maintainers.

EVMbench has real limitations worth noting. It cannot reliably distinguish genuine vulnerabilities from false positives in detect mode, excludes timing-dependent exploits, and runs only on single-chain sandbox environments rather than live mainnet conditions. Heavily audited, widely deployed contracts are likely harder to crack than those tested here.

Why This Matters:

Over $100 billion in crypto assets run on smart contracts powering decentralized exchanges, lending platforms, and on-chain financial applications. AI can now exploit nearly three-quarters of tested vulnerabilities — and that capability is accelerating. The Moonwell and CrossCurve hacks show the damage is already real. EVMbench gives developers and auditors a measurable standard to track these risks and a framework to deploy AI defensively, before attackers do.

This article was drafted with the assistance of generative AI. All facts and details were reviewed and confirmed by an editor prior to publication.