Claude Sonnet 4.5 leads SWE-Bench Verified at 64.2% · DeepSignalClaude Sonnet 4.5 leads SWE-Bench Verified at 64.2%
Claude Sonnet 4.5 jumps SWE-Bench Verified to 64.2% and adds a 200K-token context option.
Key Points
- +10.5 pt on SWE-Bench Verified vs Sonnet 4.
- 200K-token context option in the API.
- Same price tier as Sonnet 4.
Reader Mode is being prepared.
More from this source
Anthropic publishes Constitutional AI v3 — fewer refusals, better task completion
AI Summary
Anthropic's Constitutional AI v3 cuts over-refusal by 41% while preserving safety, using a tighter principle set and contrastive reinforcement.
Anthropic Researcher Mode: Claude builds and runs its own experiments
AI Summary
Anthropic's Researcher Mode gives Claude persistent compute and a sandbox for multi-day investigations and experiments.
Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems
AI Summary
Invisible orchestrators in multi-agent LLM systems pose significant safety risks and affect behavior dynamics.
OpenAI co-founder Greg Brockman reportedly takes charge of product strategy
AI Summary
OpenAI co-founder Greg Brockman is now leading product strategy amid plans to integrate ChatGPT and Codex.

arXiv cs.AI·Saharsh Koganti, Priyadarsi Mishra, Pierfrancesco Beneventano, Tomer Galanti 2d agoDistribution-Aware Algorithm Design with LLM Agents
AI Summary
The study presents a distribution-aware algorithm leveraging LLM agents for optimized solver code generation.
100
≥75 high · 50–74 medium · <50 low
Why Featured
SWE-Bench Verified is the clearest agent-coding signal; a 10pt jump is a major reset for tooling builders.