GPT-5编程测评大反转！表面不及格，实际63.1%的任务没交卷，全算上成绩比Claude高一倍 | AI Deep Signal

GPT-5编程测评大反转！表面不及格，实际63.1%的任务没交卷，全算上成绩比Claude高一倍

9/24/2025

·~3 min·9/24/2025·en·0

Quick Answer

The SWE-BENCH PRO benchmark reveals a surprising performance drop for leading AI models, with GPT-5, Claude Opus 4.1, and Gemini 2.5 all failing to exceed a 25% task completion rate.

Quick Take

The PRO benchmark reveals a surprising performance drop for leading AI models, with GPT-5, Claude Opus 4.1, and Gemini 2.5 all failing to exceed a 25% task completion rate. However, when considering unsubmitted tasks, GPT-5's effective performance is actually 63.1%, significantly outperforming Claude.

Key Points

GPT-5, Claude Opus 4.1, and Gemini 2.5 all scored below 25% in task completion.
GPT-5's effective performance, considering unsubmitted tasks, is 63.1%.
Claude's performance is significantly lower than GPT-5 when accounting for unsubmitted tasks.
The benchmark highlights a major challenge for leading AI models in software engineering tasks.
SWE-BENCH PRO reveals critical insights into AI model performance in programming.

Article Excerpt

From source RSS / original summary

Scale AI的新软件工程基准 PRO，出现反转！表面上看，“御三家”集体翻车，没一家的解决率超过25%：. GPT-5、Claude Opus 4. 1、Gemini 2. 5分别

Read on qbitai.com

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from WebSearch (Tavily)

See more →

WebSearch (Tavily)·x.com

1w ago

FeaturedOriginal

WSJ: OpenAI is considering deep price reductions as competition ...

AI Summary

OpenAI is contemplating significant price cuts in response to competitive pressure from Anthropic, particularly due to the success of Claude Code in developer and coding workflows. This shift could affect pricing strategies in the AI market as companies vie for dominance in coding solutions.

#LLM #AI Coding #Open Source #AI Startup