Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space. This implies that the utility of such representations is not limited to modeling human speech alone. Building on this understanding, this paper explores the cross-transferability of SSL neural representations learned from human speech to analyze bio-acoustic signals. We conduct a caller discrimination analysis and a caller detection study on Marmoset vocalizations using eleven SSL models pre-trained with various pretext tasks. The results show that the embedding spaces carry meaningful caller information and can successfully distinguish the individual identities of Marmoset callers without fine-tuning. This demonstrates that representations pre-trained on human speech can be effectively applied to the bio-acoustics domain, providing valuable insights for future investigations in this field.
翻译:自监督学习(SSL)模型仅利用给定信号的内在结构(独立于其声学领域),从输入中提取关键信息至嵌入空间。这意味着此类表征的实用性不仅限于建模人类语音。基于这一认识,本文探讨了从人类语音中学习的SSL神经表征在分析生物声学信号中的跨领域迁移能力。我们使用十一个基于不同前置任务预训练的SSL模型,对狨猴发声进行了呼叫者区分分析与检测研究。结果表明,嵌入空间携带了有意义的呼叫者信息,无需微调即可成功区分狨猴呼叫者的个体身份。这证明预训练于人类语音的表征可有效应用于生物声学领域,为未来的相关研究提供了宝贵见解。