SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings
Quick Answer
SPARCLE introduces a speaker-aware grapheme representation model that significantly enhances text-to-speech (TTS) generation quality, reducing word error rates by 50% in low-resource settings compared to traditional grapheme-based models.
Quick Take
SPARCLE introduces a speaker-aware grapheme representation model that significantly enhances text-to-speech (TTS) generation quality, reducing word error rates by 50% in low-resource settings compared to traditional grapheme-based models. It aligns graphemes with Wav2Vec2 acoustic representations while considering speaker identity, providing a robust alternative to G2P systems.
Key Points
- SPARCLE enhances grapheme modeling by incorporating speaker-specific acoustic variations.
- The model is trained with a contrastive objective for better alignment with acoustic representations.
- It serves as a replacement for G2P systems in downstream TTS tasks.
- Word error rates are halved in extreme low-resource settings compared to standard models.
- SPARCLE demonstrates superior performance over phoneme-based systems at scale.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2607. 01238v1 Announce Type: new Abstract: Recent advances in speech synthesis have shifted from phoneme representations to direct grapheme modeling. While phonemes address the one-to-many mapping between text and acoustics, they rely on grapheme-to-phoneme (G2P) systems that fail to capture speaker-specific acoustic variation. Prior work demonstrates that grapheme-based models outperform phoneme-based systems at scale, but not in low-resource settings.
In this paper, we propose SPARCLE, a speaker-aware grapheme representation model that enriches characters with their precise acoustic realizations. SPARCLE is trained with a contrastive objective to align graphemes with corresponding Wav2Vec2 acoustic representations while conditioned on speaker identity. The resulting model serves as a replacement to G2P systems for downstream text-to-speech (TTS) tasks.
We demonstrate that SPARCLE improves generation quality, reducing word error rates by half in extreme low-resource settings compared to standard grapheme-based models.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.