Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs
Quick Answer
The study introduces the Moral Trolley Arena benchmark to assess how frontier LLMs compose moral evidence across multiple scenarios.
Quick Take
The study introduces the Moral Trolley Arena benchmark to assess how frontier LLMs compose moral evidence across multiple scenarios. Results show that moral judgments are influenced by the strength of individual acts but exhibit a compressed, non-additive relationship, suggesting a need for more nuanced moral audits in AI models.
Key Points
- Moral Trolley Arena benchmarks LLMs on moral evidence composition across 229 scenarios.
- Composite judgments are largely predicted by individual act strength but are compressed.
- Models show non-additive intensity anchoring and bounded foundation-specific residuals.
- Results indicate a need for measuring composition rules in moral audits of AI.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 11232v1 Announce Type: new Abstract: Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same option. We introduce **Moral Trolley Arena**, a two-stage blind ELO benchmark for measuring how LLMs compose moral evidence.
The single-scene arena first calibrates individual moral acts from a 229-scenario corpus across five Moral Foundations Theory foundations; the composite arena then combines calibrated acts into two-act moral items over a controlled intensity grid and measures the resulting composite preferences. Across ten frontier models, composite judgments are largely predicted by component act strength, but the relation is consistently compressed rather than simply additive.
Models also show non-additive intensity anchoring, bounded foundation-specific residuals after component control, and highly convergent composite preference surfaces across providers. These results suggest that moral audits should measure composition rules for moral evidence, not only rankings over isolated acts.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.