Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

arXiv cs.CL·Indu Panigrahi, Tal August

3h ago

·~1 min·6/8/2026·en·0

Quick Answer

This paper shows that A new evaluation framework for large language models (LLMs) assesses their ability to generate varied responses based on language complexity.

Quick Take

A new evaluation framework for large language models (LLMs) assesses their ability to generate varied responses based on language complexity. Testing GPT-5.1, GPT-5 mini, Claude Sonnet 4.5, and DeepSeek-V3.1 on 98 scientific queries revealed that even the best model, Claude Sonnet 4.5, only consistently adjusted complexity in the desired direction 46% of the time.

Key Points

Evaluated models include GPT-5.1, GPT-5 mini, Claude Sonnet 4.5, and DeepSeek-V3.1.
The study involved 16 participants and focused on language complexity.
Claude Sonnet 4.5 achieved reliable complexity adjustments only 46% of the time.
The evaluation framework emphasizes interface-specific criteria for LLM assessments.
Findings remain consistent with larger sample sizes and alternative complexity levels.

Article Content

From source RSS / original summary

arXiv:2606. 06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria.

We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5. 1, GPT-5 mini, Claude Sonnet 4. 5 + Thinking, and DeepSeek-V3. 1 by generating 5 responses at different levels of language complexity for $98$ scientific queries.

While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4. 5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

2w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy