Are you speaking my languages? On spoken language adherence in multimodal LLMs
Quick Answer
This study addresses language adherence issues in LLM-based ASR systems, proposing a soft prompting method to enhance multilingual transcription accuracy.
Quick Take
This study addresses language adherence issues in LLM-based ASR systems, proposing a soft prompting method to enhance multilingual transcription accuracy. Three strategies—zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning—are evaluated for their effectiveness in reducing language violations while maintaining ASR performance across multiple languages.
Key Points
- Proposes a soft prompting approach to enhance multilingual ASR performance.
- Introduces a novel metric to quantify language adherence violations.
- Evaluates zero-shot prompting, supervised fine-tuning, and CoT reasoning.
- Finds trade-offs in strategy selection based on compute constraints.
- Demonstrates effectiveness in reducing language violations across languages.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 17281v1 Announce Type: new Abstract: While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output.
We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding.
We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.