Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions
Quick Take
The study evaluates nine ASR models, including Whisper, Parakeet, and Wav2Vec2, on Dutch child speech datasets JASMIN and DART. The fine-tuned Whisper-medium model achieved the best performance with a WER of 5.54% on JASMIN, while 42.0% of utterances were identified as correctly pronounced with high confidence, significantly reducing manual verification needs.
Key Points
- Whisper-medium model outperformed others with a WER of 5.54% on JASMIN.
- DART dataset posed greater challenges with a WER of 70.37%.
- 42.0% of utterances in JASMIN identified as correctly pronounced automatically.
- High precision of 98.3% achieved in utterance-level selection method.
- Study highlights challenges in ASR for low-resource languages.
Article Content
From source RSS / original summaryarXiv:2605. 28833v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for child speech remains challenging in low-resource languages due to limited child-specific pre-trained models and highly diverse noise conditions.
This study investigates the effectiveness of state-of-the-art ASR models on child speech through two research questions, by evaluating nine ASR models from three model families (Whisper, Parakeet, and Wav2Vec2) on two Dutch child speech datasets, JASMIN and DART. Research question 1 examines the performance of ASR-models applied to child speech. The fine-tuned Whisper-medium model achieves the best overall performance, with a WER of 5. 54% on JASMIN and 70.
37% on DART, showing that the noisy DART data are clearly more challenging. Research question 2 examines to what extent it is possible to select a subset for which reliable orthographic transcriptions can be obtained automatically, without the need for manual verification. We use an utterance-level selection method that compares ASR output with the original read prompt to identify correctly pronounced recordings. Using the proposed selection method, 42. 0% [for JASMIN] and 18.
1% [for DART] of the utterances can be automatically identified as correctly pronounced with high confidence, resulting in very low error rates on an utterance level (precisions of 98. 3% and higher) and reducing the need for manual verification.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.