Automated audio captioning aims at generating natural language descriptions for given audio clips, not only detecting and classifying sounds, but also summarizing the relationships between audio events. Recent research advances in audio captioning have introduced additional guidance to improve the accuracy of audio events in generated sentences. However, temporal relations between audio events have received little attention while revealing complex relations is a key component in summarizing audio content. Therefore, this paper aims to better capture temporal relationships in caption generation with sound event detection (SED), a task that locates events' timestamps. We investigate the best approach to integrate temporal information in a captioning model and propose a temporal tag system to transform the timestamps into comprehensible relations. Results evaluated by the proposed temporal metrics suggest that great improvement is achieved in terms of temporal relation generation.
翻译:自动音频字幕旨在为给定的音频片段生成自然语言描述,不仅需要检测和分类声音,还要总结音频事件之间的关联。近年来的音频字幕研究引入了额外引导机制以提高生成句子中音频事件的准确性。然而,音频事件间的时态关系尚未得到足够重视,而揭示复杂关系正是总结音频内容的关键要素。因此,本文致力于通过声音事件检测(SED)——即定位事件时间戳的任务——来更好地捕捉字幕生成中的时态关系。我们研究了在字幕模型中集成时间信息的最佳方法,并提出了一种时态标签系统,将时间戳转换为可理解的关联关系。通过所提出的时态指标评估,结果表明在时态关系生成方面取得了显著提升。