What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR
Quick Answer
The study reveals that ASR systems often misjudge atypical speech due to conflating verbatim and intended references.
Quick Take
The study reveals that ASR systems often misjudge atypical speech due to conflating verbatim and intended references. Benchmarking 11 models, including encoder-decoder and CTC types, shows significant performance disparities, emphasizing the need for appropriate transcription references in evaluations.
Key Points
- ASR systems underperform on atypical speech due to dual transcription references.
- 11 ASR models were benchmarked, revealing significant performance disparities.
- Verbatim references maintain fidelity, while intended references remove disfluencies.
- Model rankings change drastically based on the transcription style used.
- Selecting the right transcription reference is crucial for accurate ASR evaluations.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 31112v1 Announce Type: new Abstract: ASR systems have been often reported to underperform on atypical speech. An often conflated compounding factor is the existence of two valid transcription references: verbatim (actual produced speech, including repetitions/prolongations) and intended (the canonical form of the text with disfluencies removed) in atypical speech recognition depending on context and use-case.
Most ASR evaluations conflate this duality into a single ground truth and reward systems that delete disfluencies, ignoring verbatim faithfulness. We benchmark 11 ASR models from encoder-decoder, CTC and transducer families using both verbatim and intended references on atypical stuttered speech as a case study. Our quantitative assessment underlines the disparity in model performance and rankings using the two transcript styles.
Through this analysis, we highlight the importance of selecting a suitable transcription reference for valid model selection depending on the use-case, particularly for atypical ASR.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.