Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems
Quick Answer
The study introduces a percentile-based evaluation protocol for speech-to-speech AI agents, using over 4000 hours of conversation data to assess prosody and rhythm.
Quick Take
The study introduces a percentile-based evaluation protocol for speech-to-speech AI agents, using over 4000 hours of conversation data to assess prosody and rhythm. This method improves the calibration of evaluation metrics like $F_0$ expressivity and speech rate, yielding more interpretable results compared to pooled human statistics.
Key Points
- Developed matched reference regimes for key speech metrics including $F_0$ and speech rate.
- Utilized over 4000 hours of dyadic English conversation from the Seamless Interaction dataset.
- Percentile-based evaluation protocol yields flag rates closer to the nominal 10%.
- Matched references provide interpretable deviation directions for speech outputs.
- Complements existing perceptual and user-centered evaluation methods.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 31055v1 Announce Type: new Abstract: Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with model-predicted speaker traits and interaction state, pooled human statistics can be poorly calibrated for evaluating a particular output.
Using 4000+ hours of dyadic English conversation from the Seamless Interaction dataset, we construct matched reference regimes for $F_0$ mean, $F_0$ expressivity, speech rate, articulation rate, pause ratio, and mean pause duration. We then define a percentile-based evaluation protocol: extract the same metrics from an S2S output waveform, compare them to the closest matched human reference stratum, and report percentile deviations or 5th-95th percentile out-of-regime flags.
On held-out human rows, pooled references over-flag state-conditioned $F_0$ expressivity and rhythm, while matched references return flag rates closer to the nominal 10% and make deviation direction interpretable. These outputs serve as behavioral plausibility checks that complement, rather than replace, perceptual and user-centered evaluation.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.