In this study, we introduce a novel cross-modal retrieval task involving speaker descriptions and their corresponding audio samples. Utilizing pre-trained speaker and text encoders, we present a simple learning framework based on contrastive learning. Additionally, we explore the impact of incorporating speaker labels into the training process. Our findings establish the effectiveness of linking speaker and text information for the task for both English and Japanese languages, across diverse data configurations. Additional visual analysis unveils potential nuanced associations between speaker clustering and retrieval performance.
翻译:本研究提出了一项涉及说话人描述及其对应音频样本的新型跨模态检索任务。利用预训练的说话人与文本编码器,我们提出了一种基于对比学习的简洁学习框架。此外,我们探讨了在训练过程中引入说话人标签的影响。研究结果证实,将说话人与文本信息关联的方法在英语和日语两种语言以及多种数据配置下均具有有效性。进一步的视觉分析揭示了说话人聚类与检索性能之间可能存在的微妙关联。