Speaker diarization is a task to label an audio or video recording with the identity of the speaker at each given time stamp. In this work, we propose a novel machine learning framework to conduct real-time multi-speaker diarization and recognition without prior registration and pretraining in a fully online and reinforcement learning setting. Our framework combines embedding extraction, clustering, and resegmentation into the same problem as an online decision-making problem. We discuss practical considerations and advanced techniques such as the offline reinforcement learning, semi-supervision, and domain adaptation to address the challenges of limited training data and out-of-distribution environments. Our approach considers speaker diarization as a fully online learning problem of the speaker recognition task, where the agent receives no pretraining from any training set before deployment, and learns to detect speaker identity on the fly through reward feedbacks. The paradigm of the reinforcement learning approach to speaker diarization presents an adaptive, lightweight, and generalizable system that is useful for multi-user teleconferences, where many people might come and go without extensive pre-registration ahead of time. Lastly, we provide a desktop application that uses our proposed approach as a proof of concept. To the best of our knowledge, this is the first approach to apply a reinforcement learning approach to the speaker diarization task.
翻译:说话人日志是一项任务,旨在为音频或视频录音中的每个时间戳标注说话人身份。本文提出了一种新颖的机器学习框架,可在完全在线和强化学习设置下,无需预先注册和预训练,实现实时多说话人日志与识别。我们的框架将嵌入提取、聚类和重分段整合到同一个在线决策问题中。我们讨论了实际考量与先进技术,如离线强化学习、半监督学习和领域自适应,以应对训练数据有限和分布外环境的挑战。该方法将说话人日志视为说话人识别任务的一个完全在线学习问题:智能体在部署前未接受任何训练集的预训练,并通过奖励反馈在线学习检测说话人身份。这种基于强化学习的说话人日志范式提供了一个自适应、轻量级且可泛化的系统,适用于多人远程会议等场景,其中参与者可能随时加入或离开而无需事先广泛注册。最后,我们提供了一个桌面应用程序作为概念验证。据我们所知,这是首次将强化学习方法应用于说话人日志任务。