Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

arXiv cs.AI·Akshay Manglik (Emily), Apaar Shanker (Emily), Kaustubh Deshpande (Emily), Jason Qin (Emily), Yash Maurya (Emily), Veronica Chatrath (Emily), Vijay S. Kalmath (Emily), Levi Lentz (Emily), Yuan (Emily), Xue

5/22/2026

·~2 min·5/22/2026·en·4

Quick Answer

This paper shows that The Insights Generator (IG) enhances LLM agent diagnostics by automating hypothesis testing across execution trace corpora, improving scaffold performance by 30.4 percentage points.

Quick Take

The Insights Generator (IG) enhances LLM agent diagnostics by automating hypothesis testing across execution trace corpora, improving scaffold performance by 30.4 percentage points. IG's architecture provides evidence-backed insights, achieving detection coverage comparable to existing methods and receiving high ratings from domain experts for depth and quality.

Key Points

IG automates the diagnosis of LLM agent failures across large execution trace corpora.
Human experts improved scaffold performance by 30.4 percentage points using IG insights.
IG's scout-investigator architecture matches detection coverage of competing diagnostic methods.
Domain experts rated IG reports highly for depth and evidence quality.
The system produces grounded natural-language insights linked to supporting evidence.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 20 May 2026 (v1), last revised 21 May 2026 (this version, v2)]

View PDF HTML (experimental)

Abstract:Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG's scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.

Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
Cite as:	arXiv:2605.21347 [cs.AI]
	(or arXiv:2605.21347v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.21347 arXiv-issued DOI via DataCite

Submission history

From: Veronica Chatrath [view email]
[v1] Wed, 20 May 2026 16:13:53 UTC (1,429 KB)
[v2] Thu, 21 May 2026 16:51:51 UTC (1,429 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

3d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy