What Drives Interactive Improvement from Feedback?
Quick Answer
The study reveals that multi-turn language agents show limited improvement from self-generated feedback compared to strong external feedback, emphasizing the importance of the student's ability to act on feedback.
Quick Take
The study reveals that multi-turn language agents show limited improvement from self-generated feedback compared to strong external feedback, emphasizing the importance of the student's ability to act on feedback. The controlled evaluation across models like Omni-MATH and Codeforces indicates that feedback must provide specific guidance to enhance performance effectively.
Key Points
- Multi-turn improvement often lacks evidence of effective feedback use.
- Self-generated feedback adds minimal value compared to unguided self-refinement.
- Strong external teachers yield significantly larger feedback-specific gains.
- Student's ability to utilize feedback is crucial for interactive improvement.
- Controlled evaluation framework is available at https://j-lojek.github.io/feedback-generation-is-a-bottleneck/
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 30774v1 Announce Type: new Abstract: We study when natural-language feedback produces improvement beyond the gains obtainable from repeated attempts alone. In multi-turn language agent setting, higher final accuracy can reflect useful feedback, but it can also arise from resampling, format correction, or additional test-time computation.
To separate these effects, we introduce a controlled student-teacher protocol across Omni-MATH, Codeforces, BBEH Linguini, and ARC-AGI1, evaluating thirteen open-weight models in both student and teacher roles. We compare external feedback, self-feedback, and unguided self-refinement, while varying interaction history, task difficulty, and teacher access to privileged task information.
Across settings, we find that multi-turn improvement is often not evidence of feedback use: self-generated feedback adds little beyond unguided self-refinement, whereas the strongest external teachers produce substantially larger feedback-specific gains, suggesting that useful feedback must provide guidance beyond generic retry.
Dense student-teacher interaction matrices further show that interactive gains are driven more by the student's ability to use feedback than by the teacher's identity, although teacher choice remains important for a fixed student. These results suggest that feedback-based agents should be evaluated against repeated-attempt baselines, and that ability to act on feedback, not merely feedback availability, is a central bottleneck for interactive improvement.
We release our controlled student-teacher evaluation framework at https://j-lojek. github. io/feedback-generation-is-a-bottleneck/.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.