Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English
Quick Answer
This study probes wav2vec 2.0 and Whisper models to analyze consonant cluster reduction (CCR) in African American English (AAE).
Quick Take
This study probes wav2vec 2.0 and Whisper models to analyze consonant cluster reduction (CCR) in African American English (AAE). Both models accurately differentiate between reduced and canonical forms, revealing that CCR is represented as structured phonological variation rather than mere deletion, impacting automatic speech recognition (ASR) performance.
Key Points
- Layer-wise probing of wav2vec2-base and Whisper-small reveals insights into AAE phonology.
- Both models achieve high accuracy in detecting segmental reduction and restoration tasks.
- Reduced segments maintain cues to underlying stops, indicating structured phonological encoding.
- Findings highlight the need for improved ASR systems for African American English.
- Study contributes to understanding linguistic representation in modern speech models.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 23948v1 Announce Type: new Abstract: Self-supervised and supervised speech models are increasingly used to investigate which linguistic information their internal representations encode, and at what level of abstraction they encode it. One underexplored phenomenon is consonant cluster reduction (CCR) in African American English (AAE), a widespread phonological process and a source of automatic speech recognition (ASR) disparity.
To examine how CCR is represented, we conduct speaker-independent layer-wise probing of wav2vec2-base and Whisper-small using two tasks: segmental reduction detection and segmental restoration of underlying cluster identity. Both models distinguish reduced and canonical forms with high accuracy. Crucially, reduced segments retain cues to their underlying stops, indicating that CCR is encoded as structured gradient phonological variation rather than simple segmental deletion.
These results demonstrate structured phonological encoding of AAE CCR patterns in modern speech models.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.