Similar to humans, animals make extensive use of verbal and non-verbal forms of communication, including a large range of audio signals. In this paper, we address dog vocalizations and explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks that find parallels in human-centered tasks in speech recognition. We specifically address four tasks: dog recognition, breed identification, gender classification, and context grounding. We show that using speech embedding representations significantly improves over simpler classification baselines. Further, we also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.
翻译:与人类相似,动物广泛使用包括大量音频信号在内的语言和非语言交流方式。本文针对犬类发声,探索利用基于人类语音预训练的自监督语音表征模型,处理与人类语音识别任务具有类比性的犬吠分类任务。我们具体解决四个任务:个体识别、品种鉴定、性别分类和情境关联分析。研究表明,采用语音嵌入表征相比简单分类基线有显著提升。此外,我们发现基于大规模人类语音声学预训练的模型可在多个任务上带来额外性能增益。