Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses
Quick Answer
This paper shows that A new evaluation framework for large language models (LLMs) assesses their ability to generate varied responses based on language complexity.
Quick Take
A new evaluation framework for large language models (LLMs) assesses their ability to generate varied responses based on language complexity. Testing GPT-5.1, GPT-5 mini, Claude Sonnet 4.5, and DeepSeek-V3.1 on 98 scientific queries revealed that even the best model, Claude Sonnet 4.5, only consistently adjusted complexity in the desired direction 46% of the time.
Key Points
- Evaluated models include GPT-5.1, GPT-5 mini, Claude Sonnet 4.5, and DeepSeek-V3.1.
- The study involved 16 participants and focused on language complexity.
- Claude Sonnet 4.5 achieved reliable complexity adjustments only 46% of the time.
- The evaluation framework emphasizes interface-specific criteria for LLM assessments.
- Findings remain consistent with larger sample sizes and alternative complexity levels.
Article Content
From source RSS / original summaryarXiv:2606. 06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria.
We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5. 1, GPT-5 mini, Claude Sonnet 4. 5 + Thinking, and DeepSeek-V3. 1 by generating 5 responses at different levels of language complexity for $98$ scientific queries.
While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4. 5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.