Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

arXiv cs.CL·Jaehoon Kang, Yejin Lee, Yoonji Park, Kyuhong Shim

2d ago

·~1 min·5/28/2026·en·2

Quick Take

This paper presents techniques for fine-grained control in prompt-based TTS models, achieving 99-100% success in gender conversion and significant pitch and speed variations. The methods improve intra-utterance transitions, maintaining speaker similarity scores of 0.81-0.91 and perceptual smoothness of 3.48-4.48.

Key Points

Achieves smooth inter-utterance style transitions using direction vectors in embedding space.
Introduces KV-cache swapping and sliding-window attention masking for intra-utterance transitions.
Demonstrates up to 36 Hz pitch variation and 1.6 syllables-per-second speed change.
Maintains speaker similarity scores between 0.81 and 0.91 during transitions.
Perceptual smoothness scores range from 3.48 to 4.48 in experiments.

Article Content

From source RSS / original summary

arXiv:2605. 27376v1 Announce Type: new Abstract: While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance.

In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics.

For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1. 6 syllables-per-second speed change.

Our intra-utterance transition maintains a speaker similarity of 0. 81-0. 91 and achieves perceptual smoothness scores of 3. 48-4. 48.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective