Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering
Quick Answer
This paper shows that Activation steering significantly enhances full-duplex spoken language models (FD-SLMs) by improving interruption handling, increasing correctness from 28% to 45% and initial-word occurrence rate from 40% to 72% on PersonaPlex, without fine-tuning.
Quick Take
Activation steering significantly enhances full-duplex spoken language models (FD-SLMs) by improving interruption handling, increasing correctness from 28% to 45% and initial-word occurrence rate from 40% to 72% on PersonaPlex, without fine-tuning.
Key Points
- FD-SLMs exhibit state inertia, causing delays in responding to user interruptions.
- The Zero-Buffer Benchmark (ZBB) evaluates immediate comprehension during abrupt user speech.
- Activation steering uses a perception vector to dynamically adjust predictive focus.
- Improvements in interruption handling were observed across multiple state-of-the-art FD-SLMs.
- No additional computational overhead is required for activation steering implementation.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11386v1 Announce Type: new Abstract: Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored.
We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream.
Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input.
We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead.
Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.