Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma and https://huggingface.co/FlashLabs/Chroma-4B .
翻译:最近的端到端口语对话系统利用语音分词器和神经音频编解码器,使大语言模型能够直接在离散语音表示上操作。然而,这些模型通常在说话人身份保持方面表现有限,阻碍了个性化语音交互的发展。在本工作中,我们提出了Chroma 1.0,这是首个开源的、实时的、端到端的口语对话模型,它同时实现了低延迟交互和高保真度的个性化语音克隆。Chroma通过一种支持流式生成的交错文本-音频令牌调度方案(1:2),实现了亚秒级的端到端延迟,同时在多轮对话中保持了高质量的个性化语音合成。我们的实验结果表明,Chroma在说话人相似度上相对于人类基线取得了10.96%的相对提升,其实时因子为0.43,同时保持了强大的推理和对话能力。我们的代码和模型已在 https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma 和 https://huggingface.co/FlashLabs/Chroma-4B 上公开。