Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles
Quick Answer
The study investigates the causal effect of topic sentiment on perceived political ideology using annotations from human experts and models like GPT-4o-mini and Llama-3.3-70B.
Quick Take
The study investigates the causal effect of topic sentiment on perceived political ideology using annotations from human experts and models like GPT-4o-mini and Llama-3.3-70B. Fine-tuned GPT-4o-mini achieved the highest classification accuracy (F1=72.48) and demonstrated significant treatment effects, suggesting a spurious sentiment-ideology coupling not present in human judgment.
Key Points
- Human annotations show no significant causal effects at the community level.
- Fine-tuned GPT-4o-mini outperforms others with an F1 score of 72.48.
- Only GPT-4o-mini produced significant natural direct effects in mediation analysis.
- The findings suggest shortcut learning in LLMs, impacting their use as proxies for human judgment.
- Implications arise for using LLM annotations as silver labels in causal analyses.
Article Content
From source RSS / original summaryarXiv:2606. 06715v1 Announce Type: new Abstract: We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3. 3-70b-versatile, we compare ideology labels from expert human annotators, GPT-4o-mini (baseline and finetuned), and Llama-3. 3-70B.
We apply Double Machine Learning (DML) and community-level mediation analysis across all four annotation paradigms. Human annotations yield no significant causal effects at the community level. Fine-tuned GPT-4o-mini achieves the highest classification accuracy (F1=72. 48) and is the only annotator paradigm that produces significant community-level treatment effects and significant natural direct effects (NDEs) in mediation.
We interpret this as evidence of shortcut learning: fine-tuning on ideology-labeled data causes the model to internalise a spurious sentiment--ideology coupling not operative in human judgment for this task. This coupling is structurally invisible to F1-based evaluation, with implications for the use of LLM annotations as silver labels and as proxies for human judgment in downstream causal analyses.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.