FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations

Yoonhyung Lee,  Hyunsin Park,  Jinhwan Park,  Jinkyu Lee
Qualcomm AI Research

Model Architecture

FC-TTS model architecture diagram showing the two-stage generation pipeline with timbre and style conditioning

FC-TTS architecture: a two-stage log-mel generation pipeline conditioned on separate timbre and style references via FACodec representations.

Abstract

Recent advances in zero-shot text-to-speech (TTS) have enabled accurate imitation of reference speech in terms of both speaking style and speaker timbre. However, achieving disentangled control over these aspects from separate references remains a challenging task. Several studies have proposed disentangled speech representations that decompose speech into interpretable attributes (e.g., timbre, prosody, and content), providing a promising foundation for TTS with attribute control from separate references. Yet, how to effectively integrate such representations into TTS systems to achieve independent and precise control remains underexplored.

In this paper, we present FC-TTS, a zero-shot TTS framework that enables disentangled control of style and timbre by conditioning on two distinct reference utterances. Unlike existing systems that inherit limitations from those pre-trained disentangled representations, FC-TTS introduces key design strategies, including architectural choices, training framework, and auxiliary training objectives, which improve the reliability of attribute separation and dual-reference control. Experiments show that FC-TTS achieves high-fidelity synthesis and competitive zero-shot naturalness, while uniquely supporting consistent and independent manipulation of style and timbre.

1. Zero-shot TTS

Each reference provides both timbre and style; the generated sample is synthesized by FC-TTS conditioned on that single reference.

Reference samples are shown as dataset identifiers rather than embedded audio. Users can access the original recordings via the publicly available dataset (LibriSpeech).

# Reference Generated (FC-TTS)
1
LibriSpeechtest-clean/7021/85628/7021-85628-0005.flac
2
LibriSpeechtest-clean/8455/210777/8455-210777-0056.flac
3
LibriSpeechtest-clean/1089/134691/1089-134691-0006.flac
4
LibriSpeechtest-clean/4992/23283/4992-23283-0019.flac
5
LibriSpeechtest-clean/3570/5695/3570-5695-0006.flac
6
LibriSpeechtest-clean/6829/68771/6829-68771-0029.flac

2. Timbre Controllability

Timbre control evaluation uses RAVDESS utterances as style references and LibriSpeech speakers (male/female) as timbre targets. Since no official NaturalSpeech 3 inference code is available, we simulate its timbre control capability using FACodec-based voice conversion, which directly swaps the speaker embedding in FACodec — representing a practical upper bound for FACodec-based TTS systems. For each gender, samples with the same number share the same style reference. Please focus on timbre similarity to the target speaker when listening.

Reference samples are shown as dataset identifiers rather than embedded audio. Users can access the original recordings via the publicly available dataset (LibriSpeech).

Timbre reference (Female)

LibriSpeechtest-clean/4507/16021/4507-16021-0005.flac

Timbre reference (Male)

LibriSpeechtest-clean/8224/274384/8224-274384-0009.flac

Female timbre control samples

# Prosody Reference FACodec FC-TTS
1
2
3
4

Male timbre control samples

# Prosody Reference FACodec FC-TTS
1
2
3
4

3. Style Controllability

Style control evaluation compares FC-TTS and F5-TTS under the same prosody reference. Samples with the same number use the same RAVDESS utterance as the prosody reference. For FC-TTS, a neutral utterance from the same speaker is additionally used as the timbre reference.

# Prosody reference F5-TTS FC-TTS
1
2
3
4
5

4. Ablation Study

Audio samples correspond to the log-mel spectrograms in Figure 4 of the paper.

Ablation study log-mel spectrograms (Figure 4): comparing baseline, single-stage, in-context learning, prosody variant, timbre variant, and without CCL
Variant Audio
Timbre reference
Prosody reference
Baseline
Single-stage generation
In-context learning
Without CCL