Continuous Audio Thinking for Large Audio Language Models
Quick Answer
This paper shows that The Continuous Audio Thinking (CoAT) framework enhances large audio language models (LALMs) like Qwen2-Audio and Audio Flamingo 3 by preserving acoustic information during response generation, leading to improved performance across audio reasoning and transcription tasks without additional decoding costs.
Quick Take
The Continuous Audio Thinking (CoAT) framework enhances large audio language models (LALMs) like Qwen2-Audio and Audio Flamingo 3 by preserving acoustic information during response generation, leading to improved performance across audio reasoning and transcription tasks without additional decoding costs. This method demonstrates significant gains in benchmarks, showcasing the effectiveness of expert distillation in audio processing.
Key Points
- CoAT introduces a continuous latent workspace for organizing acoustic information.
- Performance improvements observed in Qwen2-Audio and Audio Flamingo 3 across multiple benchmarks.
- No additional autoregressive decoding costs compared to baseline models.
- Expert distillation enhances the model's ability to leverage rich acoustic details.
- Results confirm effective propagation of auxiliary supervision to textual responses.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 18273v1 Announce Type: new Abstract: Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information.
As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts.
Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.
5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.