GPT and Claude failed Bridgewater's finance… | AI Deep Signal

GPT and Claude failed Bridgewater's finance tests because the right answers were never public

The Decoder·Maximilian Schreiner

2h ago

·~1 min·7/3/2026·en·0

Quick Answer

Bridgewater and Thinking Machines Lab found that their finely tuned open-weight model surpassed GPT and Claude in financial document evaluations, achieving better performance at a significantly lower cost.

Quick Take

Bridgewater and Thinking Machines Lab found that their finely tuned surpassed GPT and Claude in financial document evaluations, achieving better performance at a significantly lower cost. This highlights the limitations of current leading AI models when faced with proprietary benchmarks that are not publicly available.

Key Points

Bridgewater's analysis shows their model outperforms GPT and Claude.
The open-weight model is significantly cheaper than top AI models.
Performance metrics were based on proprietary financial benchmarks.
Current AI models struggle with evaluations lacking public answer keys.
This raises questions about the reliability of existing AI in finance.

Article Excerpt

From source RSS / original summary

The hedge fund Bridgewater and Thinking Machines Lab report that a finely tuned outperforms the most powerful AI models in the evaluation of financial documents, at a fraction of the cost. The figures come from their own analysis. The article GPT and Claude failed Bridgewater's finance tests because the right answers were never public appeared first on The Decoder.

Read on the-decoder.com

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from The Decoder

See more →

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

The Decoder·Matthias Bastian

6d ago

FeaturedOriginal

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

AI Summary

Epoch AI's MirrorCode benchmark reveals Claude Opus 4.7 as the leader with a 56% solve rate, reconstructing a 16,000-line toolkit in 14 hours. Despite this, all models tested struggle with the most complex tasks, highlighting limitations in current AI capabilities. The single task consumed $2,600 over 19 days, raising questions about cost-effectiveness in AI development.

#LLM #AI Coding #Inference #AI Startup