Full-duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi-cascaded full-duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text-to-speech (TTS) synthesis. The resulting system operates in a train-free, plug-and-play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction). Code is available at the GitHub repository https://github.com/yu-haoyuan/fd-badcat.
翻译:全双工语音交互对于自然的人机交互至关重要。我们提出了一种框架,将复杂对话分解为最小的对话单元,使系统能够独立处理每个单元并预测何时转换到下一个单元。该框架实例化为一个围绕多模态大语言模型构建的半级联全双工对话系统,并辅以语音活动检测(VAD)和文本转语音(TTS)合成等辅助模块。最终系统以无需训练、即插即用的方式运行。在HumDial数据集上的实验证明了我们框架的有效性,其在类人语音对话系统挑战赛(赛道2:全双工交互)测试集上的表现位列所有参赛团队第二。代码可在GitHub仓库 https://github.com/yu-haoyuan/fd-badcat 获取。