Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

6/4/2026

·~2 min·6/4/2026·en·1

Quick Answer

This paper shows that Discourse-role labels significantly influence language model behavior, with misleading adoption rates varying by 56-84 percentage points across models like GPT-5.5 and Llama-3-8B-Instruct.

Quick Take

Labels like 'Instruction:' and 'Reference:' increase reliance on incorrect options, while 'Example:' suppresses it. This highlights the need for context-utilization benchmarks to control for presentation choices.

Key Points

Misleading adoption rates vary by 56-84 percentage points across tested models.
Labels like 'Instruction:' and 'Reference:' lead to higher reliance on incorrect answers.
'Example:' label consistently suppresses misleading adoption.
Boundary probes reveal context shapes the effect of labels on adoption.
A manual audit confirms stability of short-answer contrasts under conservative adjudication.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

From the original publisher, up to about 700 characters

arXiv:2606. 04109v1 Announce Type: new Abstract: Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored.

We introduce a paired fixed-content probe over 500 items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs the injected wrong option. Across GPT-5. 5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2. 5-7B-Instruct, Misleading Adoption Rate shifts by 56-84 percentage points. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

6d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust Judges for Evidence-based Research Agents?