Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-supervised learning framework that generalizes to mixture speech using a masked pseudo source separation objective. This objective encourages the model to identify the number of sources, separate and understand the context, and infer the content of masked regions represented as discovered units. Cocktail HuBERT outperforms state-of-the-art results with 69% lower WER on multi-speaker ASR, 31% lower DER on diarization, and is competitive on single- and multi-speaker tasks from SUPERB.
翻译:自监督学习利用无标注数据的有效性,提升了标签效率并增强了对无标注数据领域的泛化能力。尽管近期研究已扩展到声学/语言领域、语种及模态的泛化,但这些探索仍局限于录音中仅含单一声源的单源语音。本文提出Cocktail HuBERT——一种通过掩蔽伪源分离目标实现混合语音泛化的自监督学习框架。该目标促使模型识别声源数量、分离并理解上下文,进而推断以所发现单元表征的掩蔽区域内容。实验表明,Cocktail HuBERT在多说话人语音识别任务中词错误率降低69%,在说话人日志任务中错误率降低31%,并在SUPERB基准的单说话人与多说话人任务中均达到顶尖水平。