This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex capabilities allowing simultaneous speaking and listening. The paper also details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction. We also developed a platform to evaluate conversational fluency and response coherence in multi-turn dialogues. We hope the release of the report can contribute to the future development of spoken LLMs in Taiwanese Mandarin.
翻译:本技术报告介绍了我们构建台湾国语口语大语言模型(LLM)的初步尝试,该模型专门设计用于实现多轮对话中的实时语音到语音交互。我们的端到端模型采用仅解码器的Transformer架构,旨在实现无缝交互的同时保持对话流畅性,包括允许同时说话和聆听的全双工能力。本文还详细阐述了训练过程,包括使用合成对话进行数据准备以及针对实时交互的调整。我们还开发了一个平台,用于评估多轮对话中的会话流畅性和响应连贯性。我们希望本报告的发布能够为台湾国语口语大语言模型的未来发展做出贡献。