GPT-5编程测评大反转!表面不及格,实际63.1%的任务没交卷,全算上成绩比Claude高一倍
Quick Answer
The SWE-BENCH PRO benchmark reveals a surprising performance drop for leading AI models, with GPT-5, Claude Opus 4.1, and Gemini 2.5 all failing to exceed a 25% task completion rate.
Quick Take
The PRO benchmark reveals a surprising performance drop for leading AI models, with GPT-5, Claude Opus 4.1, and Gemini 2.5 all failing to exceed a 25% task completion rate. However, when considering unsubmitted tasks, GPT-5's effective performance is actually 63.1%, significantly outperforming Claude.
Key Points
- GPT-5, Claude Opus 4.1, and Gemini 2.5 all scored below 25% in task completion.
- GPT-5's effective performance, considering unsubmitted tasks, is 63.1%.
- Claude's performance is significantly lower than GPT-5 when accounting for unsubmitted tasks.
- The benchmark highlights a major challenge for leading AI models in software engineering tasks.
- SWE-BENCH PRO reveals critical insights into AI model performance in programming.
Article Excerpt
From source RSS / original summaryScale AI的新软件工程基准 PRO,出现反转! 表面上看,“御三家”集体翻车,没一家的解决率超过25%:. GPT-5、Claude Opus 4. 1、Gemini 2. 5分别
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from WebSearch (Tavily)
See more →WSJ: OpenAI is considering deep price reductions as competition ...
OpenAI is contemplating significant price cuts in response to competitive pressure from Anthropic, particularly due to the success of Claude Code in developer and coding workflows. This shift could affect pricing strategies in the AI market as companies vie for dominance in coding solutions.