GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

arXiv cs.CL·Yihang Lin, Yunze Gao, Zeyang Lin, Dongbo Li, Kun Peng, Chenglong Song, Yue Liu

1d ago

·~2 min·5/29/2026·en·1

Quick Take

GrowLoop is a self-evolving conversation evaluation system that adapts to advancements in large language models (LLMs) by refining evaluation rubrics through Heuristic Learning. It significantly outperforms existing benchmarks in aligning with human judgments and reveals overlooked issues, facilitating continuous evolution in assessing human-likeness in open-ended conversations.

Key Points

GrowLoop uses minimal human seed annotations to initiate evaluation.
It employs Heuristic Learning to iteratively refine evaluation rubrics.
The system adapts to evolving criteria of human-likeness in conversations.
It uncovers issues that traditional annotators often overlook.
The benchmark effectively discriminates models across different capability tiers.

Article Content

From source RSS / original summary

arXiv:2605. 28882v1 Announce Type: new Abstract: With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others.

Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-like is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously.

Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. With minimal human seed annotations as the first mover, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge.

Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution, expanded through new seeds when the evaluation target moves. Applied to human-likeness evaluation in open-ended conversation, the generated rubrics not only substantially outperform existing methods in alignment with human judgments, but also uncover issues that annotators overlook.

The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective