This study investigates the use of Visually Grounded Speech (VGS) models for keyword localisation in speech. The study focusses on two main research questions: (1) Is keyword localisation possible with VGS models and (2) Can keyword localisation be done cross-lingually in a real low-resource setting? Four methods for localisation are proposed and evaluated on an English dataset, with the best-performing method achieving an accuracy of 57%. A new dataset containing spoken captions in Yoruba language is also collected and released for cross-lingual keyword localisation. The cross-lingual model obtains a precision of 16% in actual keyword localisation and this performance can be improved by initialising from a model pretrained on English data. The study presents a detailed analysis of the model's success and failure modes and highlights the challenges of using VGS models for keyword localisation in low-resource settings.
翻译:本研究探讨了使用视觉接地语音(VGS)模型进行语音中关键词定位的应用。研究聚焦于两个主要问题:(1)VGS模型能否实现关键词定位;(2)在真实的低资源场景下,能否跨语言进行关键词定位。本文提出了四种定位方法,并在英语数据集上进行了评估,其中表现最佳的方法达到了57%的准确率。此外,研究收集并发布了一个包含约鲁巴语语音描述的新数据集,用于跨语言关键词定位。跨语言模型在实际关键词定位中的精度为16%,而通过从预训练的英语数据模型进行初始化可提升这一性能。研究对模型的成功与失败模式进行了详细分析,并强调了在低资源场景下使用VGS模型进行关键词定位所面临的挑战。