Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S
翻译:共情交互是人机通信的基石,其关键在于理解富含副语言线索的语音并生成富有情感和表现力的回应。然而,最强大的共情大语音语言模型正日益封闭,其架构、数据及开发的关键细节对研究人员而言变得不透明。鉴于对大语音语言模型及其共情行为进行透明研究的迫切需要,我们提出了OpenS2S,一个全开源、透明且端到端的大语音语言模型,旨在实现共情的语音交互。基于我们已有的共情语音转文本模型BLSP-Emo,OpenS2S进一步采用流式交错解码架构,以实现低延迟语音生成。为促进端到端训练,OpenS2S集成了一个自动化数据构建流程,能够以低成本合成多样化、高质量的共情语音对话。通过利用大语言模型生成共情内容,并结合可控的文本转语音系统引入说话者与情感变化,我们构建了一个具有丰富副语言多样性且仅需极少人工监督的可扩展训练语料库。我们发布了全开源的OpenS2S模型,包括数据集、模型权重、预训练与微调代码,以赋能更广泛的研究社区,并加速共情语音系统的创新。项目网页可通过 https://casia-lm.github.io/OpenS2S 访问。