AI auditing enters practical application, OpenAI releases EVMbench to enhance smart contract security ratings

ETH3,4%
WELL4,86%

OpenAI Collaborates with Paradigm to Launch EVMbench, Testing AI Agents’ Defense and Attack Capabilities in EVM Contracts, Revealing Strengths and Weaknesses.

Focusing on Real-World Economic Environment Testing, OpenAI and Paradigm Enhance On-Chain Security Ratings

Leading AI company OpenAI announced a partnership with well-known cryptocurrency venture capital firm Paradigm and security firm OtterSec to launch EVMbench, a benchmark tool designed to evaluate the security performance of AI agents in Ethereum Virtual Machine (EVM) smart contracts.

As AI and blockchain technologies converge deeply, smart contracts have become the core infrastructure managing over $100 billion in open-source crypto assets. The release of this tool signifies that the industry is beginning to recognize AI’s practical capabilities within economically meaningful environments.

OpenAI team notes that with the rapid advancement of AI agents in coding and planning, these models will play transformative roles in blockchain attack and defense in the future. Therefore, establishing a standardized evaluation framework is crucial for monitoring AI progress.

Three Deep Testing Modes with 120 Real Audit Vulnerabilities as the Benchmark

EVMbench’s core design centers around 120 high-risk vulnerabilities extracted from 40 professional audit reports. Data sources include well-known public audit competitions like Code4rena, ensuring testing scenarios closely resemble real-world complexity. The benchmark evaluates AI agents in three different operational modes:

Image source: OpenAI EVMbench core design evaluates AI agents in three different modes

  • The first is “Detection Mode,” where AI audits contract codebases and identifies known vulnerabilities, assigning scores based on the severity of issues found;
  • The second is “Patch Mode,” challenging AI to remove exploitable vulnerabilities and repair code without altering existing functionality;
  • The final, highly controversial mode is “Exploit Mode,” where AI must execute end-to-end fund theft attacks within sandboxed blockchain environments.

To ensure rigorous and repeatable testing, the team developed a Rust-based testing framework that uses deterministic transaction replay techniques to verify whether AI’s attacks or patches succeed.

Significant Trend of Attack-Strength, Defense-Weakness; GPT-5.3-Codex Shows Remarkable Growth in Attacks

Initial test results reveal a clear performance gap across different tasks. The latest GPT-5.3-Codex performs exceptionally well in Exploit Mode, scoring as high as 72.2%, a dramatic improvement compared to GPT-5, released just six months earlier, which scored 31.9%.

Image source: Overview of scores for various AI models across three modes

This indicates that when the goal is explicitly “draining funds,” AI demonstrates strong iterative planning and execution capabilities. However, on the defense side, performance is comparatively weaker. AI often stops searching after discovering a single flaw in detection mode, and struggles to perfectly patch complex logic without affecting normal contract operation. Security experts express concern that AI could significantly shorten the time from vulnerability discovery to attack development, raising the bar for DeFi project defenses.

Talent Acquisition and Defense Funding, OpenAI’s Strategy for AI Agent Ecosystem Security

Beyond tool development, OpenAI is actively investing in talent and ecosystem defense. Recently, it hired Peter Steinberger, founder of the open-source AI agent project OpenClaw, to lead the development of next-generation personalized agents, transforming the project into an OpenAI-supported foundation model.

To address potential cybersecurity risks posed by AI, OpenAI commits to a $10 million API budget through its cybersecurity grant program to support open-source defense tools and critical infrastructure research. This move is particularly timely following the recent Moonwell protocol incident, where a coding error in AI-generated code caused approximately $1.78 million in losses.

Further Reading
Refusing Meta’s Billion-Dollar Offer, OpenClaw Creator Joins OpenAI in Talent Race; Is Vibe Coding to Blame? Moonwell Oracle Fails, Who Will Cover the $1.78M Loss?

Looking ahead, as more AI-assisted stablecoin payment agents and automated wallets join the ecosystem, the ability to distinguish models that merely describe vulnerabilities from those that can reliably provide defense solutions using tools like EVMbench will become a critical turning point in blockchain security.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.

Related Articles

US Traditional Exchange “On-Chain 3-Part Refrain”: Tokenization Is Reshaping Collateral, Trading, and Margin

Author: Jae, PANews When Bitcoin is still sitting uncertainly around the $70,000 mark, Wall Street—the global financial heart—has completed three resonances within 48 hours. Three giants controlling the flow of global capital—NYSE, Nasdaq, and CME Group—have successively announced upgraded plans for business tokenization. Nasdaq is developing a tokenized collateral management solution, the NYSE and Securitize are collaborating to develop a tokenized securities platform, and CME Group has rolled out institutional “tokenized cash” settlement services. With the three top exchanges moving on three fronts at once, and leveraging blockchain technology, they have carried out a deep overhaul of the world’s liquidity “pipeline.” When traditional Wall Street incumbents actively embrace tokenization, the rules of the global capital markets game are being rewritten. Goodbye to T+1, Nasdaq uses tokenization to put $35 billion in collateral to work $35 billion—this is what Nasdaq estimates is “sleeping” in global fi

区块客32m ago

Ripple’s XRP Crashes 7% Weekly While New Crypto Project GCOIN by PlayNance Gains Momentum

The cryptocurrency market has lost more than $200 billion in total capitalization over the past few days. This comes on the back of a 7% drop in Bitcoin’s price, which also dragged down most altcoins. Ripple’s XRP is no exception. XRP Price Shaky Amid Global

CryptoPotato38m ago

Ethereum Foundation makes a big move by staking 22,000 ETH, setting a new “highest single-day staking record”

The Ethereum Foundation staked $46 million worth of Ether on March 30, setting a new all-time daily high record, and plans to actively deploy its capital to generate returns and support the long-term development of the Ethereum ecosystem. Currently, it still holds about $302 million worth of Ether.

区块客39m ago

Matrixport rebrands as BIT in strategic repositioning

This publication is provided by the client. The text below is a paid press release that is not part of Cointelegraph.com independent editorial content. The text has undergone editorial review to ensure quality and relevance, it may not reflect the views and opinions of Cointelegraph.com. Readers

Cointelegraph53m ago

CoinDCX Launches ₹100 Cr Digital Suraksha Network

India’s crypto exchange CoinDCX has announced a major cyber safety initiative called Digital Suraksha Network. This initiative comes after a shocking fraud case. Co-founders Sumit Gupta and Neeraj Khandelwal were recently detained over a scam linked to a fake website. But the situation quickly chan

Coinfomania1h ago

Does USDT have full reserves? Tether reportedly hired KPMG for a comprehensive audit

Tether is conducting a comprehensive financial audit of USDT, hiring KPMG and PwC to improve financial transparency and internal processes. This move comes as it prepares to expand into the U.S. market and raise funds, aiming to address outside doubts about its U.S. dollar reserves. Detailed information about Tether’s past reserves was revealed after a legal battle, showing changes in its asset allocation.

区块客1h ago
Comment
0/400
No comments