PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation
Quick Answer
The study introduces Phonetically-Informed Data Augmentation (PiDA) to enhance Vietnamese speech translation by addressing ASR substitution errors, achieving up to +2.04 BLEU improvement on erroneous outputs.
Quick Take
The study introduces Phonetically-Informed Data Augmentation (PiDA) to enhance Vietnamese speech translation by addressing ASR substitution errors, achieving up to +2.04 BLEU improvement on erroneous outputs. This method leverages phonetic embeddings to generate realistic corruptions, significantly boosting Neural Machine Translation performance.
Key Points
- First systematic categorization of ASR errors in Vietnamese speech translation.
- Phonetic confusions, not random noise, primarily cause ASR substitution errors.
- PiDA improves translation quality on erroneous ASR outputs by up to +2.04 BLEU.
- Fine-tuning on PiDA-augmented data also enhances clean-text performance.
- Utilizes phonetic word embeddings for generating ASR-like corruptions.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 12911v1 Announce Type: new Abstract: Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling.
We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.
04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.