We study the impact of visual assistance for automated audio captioning. Utilizing multi-encoder transformer architectures, which have previously been employed to introduce vision-related information in the context of sound event detection, we analyze the usefulness of incorporating a variety of pretrained features. We perform experiments on a YouTube-based audiovisual data set and investigate the effect of applying the considered transfer learning technique in terms of a variety of captioning metrics. We find that only one of the considered kinds of pretrained features provides consistent improvements, while the others do not provide any noteworthy gains at all. Interestingly, the outcomes of prior research efforts indicate that the exact opposite is true in the case of sound event detection, leading us to conclude that the optimal choice of visual embeddings is strongly dependent on the task at hand. More specifically, visual features focusing on semantics appear appropriate in the context of automated audio captioning, while for sound event detection, time information seems to be more important.
翻译:我们研究了视觉辅助对自动音频字幕生成的影响。利用此前在声音事件检测中引入视觉相关信息的多编码器Transformer架构,我们分析了融入多种预训练特征的有效性。我们在基于YouTube的视听数据集上进行了实验,并从多种字幕生成指标角度考察了所采用的迁移学习技术的效果。结果发现,仅有一类预训练特征能带来持续改进,而其他特征并未产生显著增益。有趣的是,此前研究结果表明,在声音事件检测中情况恰恰相反,这使我们得出结论:视觉嵌入的最优选择高度依赖于具体任务。具体而言,在自动音频字幕生成中,聚焦语义的视觉特征更为适用;而在声音事件检测中,时间信息似乎更为重要。