LLMs Can Better Capture Human Judgments--With the Right Prompts

arXiv cs.CL·Danica Dillion, Chen Cecilia Liu, Baihui Wang, Daniele Barolo, Tanmay Rajore, Niket Tandon, Pranathi Ravikumar, Kurt Gray

1d ago

·~1 min·6/12/2026·en·1

Quick Answer

This paper shows that Large language models (LLMs) can better align with human judgments by using effective prompting strategies, such as reporting standard deviations and ensuring clarity in scenarios.

Quick Take

Large language models (LLMs) can better align with human judgments by using effective prompting strategies, such as reporting standard deviations and ensuring clarity in scenarios. This approach improves response accuracy across diverse moral scenarios and beliefs, demonstrating that better questions yield better answers.

Key Points

Prompting LLMs to report standard deviations improves response range capture.
Clear scenarios enhance model alignment with human confusion ratings.
LLMs can predict human variability but poorly calibrate their own error.
Two datasets used include 144 moral scenarios and 38 moral beliefs.
Effective prompting strategies can significantly improve AI-human alignment.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 12754v1 Announce Type: new Abstract: Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U. S.

-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies.

Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy