We introduce Epic-Sounds, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from video, discarding ambiguities. Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate state-of-the-art audio recognition and detection models on our dataset, for both audio-only and audio-visual methods. We also conduct analysis on: the temporal overlap between audio events, the temporal and label correlations between audio and visual modalities, the ambiguities in annotating materials from audio-only input, the importance of audio-only labels and the limitations of current models to understand actions that sound. Project page : https://epic-kitchens.github.io/epic-sounds/
翻译:我们介绍了Epic-Sounds,这是一个大规模音频标注数据集,用于捕获自我中心视频音频流中的时间范围及类别标签。我们提出了一种标注流程,标注者需对可区分的音频片段进行时间标注,并描述可能产生该声音的动作。通过对这些自由形式的音频描述进行分组归类,我们识别出仅凭音频即可区分的动作。对于涉及物体碰撞的动作,我们收集了这些物体材质的人类标注(例如,一个玻璃物体被放置在木质表面上),并通过视频验证这些标注,同时剔除模糊不清的案例。总体而言,Epic-Sounds包含78.4k个已分类的可听事件与动作片段,分布于44个类别,以及39.2k个未分类片段。我们在数据集上训练并评估了最先进的音频识别与检测模型,涵盖纯音频及视听融合方法。我们还进行了以下分析:音频事件的时间重叠性、音频与视觉模态间的时间与标签关联性、仅凭音频输入标注材质时的模糊性、纯音频标签的重要性,以及当前模型在理解发声动作方面的局限性。项目页面:https://epic-kitchens.github.io/epic-sounds/