Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers
Quick Take
This study reveals that large language models (LLMs) struggle to infer user sociodemographics from conversation history, leading to minimal disparities in advice outcomes across groups. However, conversation topics significantly influence LLM-generated advice, acting as proxies for sociodemographic factors, which raises concerns for high-stakes applications.
Key Points
- LLMs show minimal disparities in advice outcomes across sociodemographic groups.
- Conversation topics are the most predictive of LLM-generated advice.
- User sociodemographics inferred from conversation history are often inaccurate.
- Disparities in outcomes highlight the need for further research on LLM context effects.
- High-stakes scenarios require careful consideration of conversational context.
Article Content
From source RSS / original summaryarXiv:2606. 02776v1 Announce Type: new Abstract: When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others.
In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability.
We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.