Re-Centering Humans in LLM Personalization

arXiv cs.CL·Lechen Zhang, Jiarui Liu, Tal August

3h ago

·~2 min·6/8/2026·en·0

Quick Answer

This study reveals significant limitations in LLM personalization using human data compared to synthetic data, highlighting struggles in attribute extraction and response generation.

Quick Take

This study reveals significant limitations in LLM personalization using human data compared to synthetic data, highlighting struggles in attribute extraction and response generation. Despite introducing lightweight training interventions, human-aligned quality judgments remain challenging to model, indicating a need for improved methods in user information incorporation.

Key Points

Evaluated 550 human conversations to assess LLM personalization performance.
Found models struggle with attribute extraction and relevant judgment alignment.
Personalized responses rated no better than generic ones by humans.
Introduced interventions to improve automated personalization evaluation.
Human-aligned quality judgments are difficult to model directly.

Article Content

From source RSS / original summary

arXiv:2606. 06614v1 Announce Type: new Abstract: Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data.

We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage.

Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages.

However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

2w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy