Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving
Quick Answer
The proposed Joint Speech-Text Interleaved Pretraining (JSTIP) enhances ASR performance by interleaving speech-text sequences, achieving improved entity accuracy on 38k hours of data.
Quick Take
The proposed Joint Speech-Text Interleaved Pretraining (JSTIP) enhances ASR performance by interleaving speech-text sequences, achieving improved entity accuracy on 38k hours of data. JSTIP matches domain transcription performance while simplifying adaptation, outperforming traditional ASR and joint training methods, particularly in medical entity recognition.
Key Points
- JSTIP constructs interleaved speech-text sequences for improved ASR training.
- Achieves consistent entity accuracy improvements over ASR-only and joint training baselines.
- Competes with open-source ASR systems in medical entity recognition tasks.
- Zero-shot speech question answering shows reduced modality gaps with JSTIP.
- Utilizes 38k hours of ASR data for robust training results.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01733v1 Announce Type: new Abstract: Speech-LLM integration has shown promising results by leveraging extensive textual pretraining, yet its specific benefits for automatic speech recognition (ASR) remain unclear. We observe that as supervised ASR training data increases, the contribution of LLM priors becomes less evident, and simple speech-text joint training under-utilizes textual knowledge.
We therefore propose Joint Speech-Text Interleaved Pretraining (JSTIP), an ASR-oriented pretraining strategy that constructs word-level and segment-level interleaved speech-text sequences within aligned pairs for speech-LLM architectures that accept continuous inputs. Experiments on 38k hours of ASR data show consistent entity accuracy improvement compared to ASR-only and joint speech-text training baselines.
JSTIP achieves on-par entity recognition performance using domain transcription text compared to synthetic speech-text pairs, simplifying domain adaptation. Benefiting from textual pretraining and domain text data, JSTIP is competitive with open-source ASR and Speech-LLM systems in medical entity recognition.
The zero-shot speech question answering behaviors further suggest that interleaving reduces the speech-text modality gap and preserves the LLM generative prior, which is likely the reason for the entity improvements on the ASR task.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.