Recent advances in zero-shot text-to-speech (TTS) have enabled accurate imitation of reference speech
in terms of both speaking style and speaker timbre. However, achieving disentangled control over these
aspects from separate references remains a challenging task. Several studies have proposed disentangled
speech representations that decompose speech into interpretable attributes (e.g., timbre, prosody,
and content), providing a promising foundation for TTS with attribute control from separate references.
Yet, how to effectively integrate such representations into TTS systems to achieve independent and precise
control remains underexplored.
In this paper, we present FC-TTS, a zero-shot TTS framework that enables
disentangled control of style and timbre by conditioning on two distinct reference
utterances. Unlike existing systems that inherit limitations from those pre-trained disentangled
representations, FC-TTS introduces key design strategies, including architectural choices, training
framework, and auxiliary training objectives, which improve the reliability of attribute separation
and dual-reference control. Experiments show that FC-TTS achieves high-fidelity synthesis and
competitive zero-shot naturalness, while uniquely supporting consistent and independent manipulation
of style and timbre.