Re-Centering Humans in LLM Personalization
Quick Answer
This study reveals significant limitations in LLM personalization using human data compared to synthetic data, highlighting struggles in attribute extraction and response generation.
Quick Take
This study reveals significant limitations in LLM personalization using human data compared to synthetic data, highlighting struggles in attribute extraction and response generation. Despite introducing lightweight training interventions, human-aligned quality judgments remain challenging to model, indicating a need for improved methods in user information incorporation.
Key Points
- Evaluated 550 human conversations to assess LLM personalization performance.
- Found models struggle with attribute extraction and relevant judgment alignment.
- Personalized responses rated no better than generic ones by humans.
- Introduced interventions to improve automated personalization evaluation.
- Human-aligned quality judgments are difficult to model directly.
Article Content
From source RSS / original summaryarXiv:2606. 06614v1 Announce Type: new Abstract: Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data.
We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage.
Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages.
However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.