Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space. This implies that the utility of such representations is not limited to modeling human speech alone. Building on this understanding, this paper explores the cross-transferability of SSL neural representations learned from human speech to analyze bio-acoustic signals. We conduct a caller discrimination analysis and a caller detection study on Marmoset vocalizations using eleven SSL models pre-trained with various pretext tasks. The results show that the embedding spaces carry meaningful caller information and can successfully distinguish the individual identities of Marmoset callers without fine-tuning. This demonstrates that representations pre-trained on human speech can be effectively applied to the bio-acoustics domain, providing valuable insights for future investigations in this field.
翻译:自监督学习模型仅利用给定信号的内在结构(与其声学域无关)从输入中提取关键信息并映射至嵌入空间。这意味着这类表征的效用并不局限于人类语音建模。基于这一认识,本文探索了从人类语音中学习到的自监督神经表征在生物声学信号分析中的跨域迁移能力。我们利用十一个通过不同预训练任务训练的模型,对狨猴发声进行叫声者辨别分析和检测研究。结果表明,嵌入空间承载着有意义的叫声者信息,无需微调即可成功区分狨猴叫声者的个体身份。这证实了基于人类语音预训练的表征能够有效应用于生物声学领域,为未来该领域的研究提供了重要启示。