Video captioning aims to describe events in a video with natural language. In recent years, many works have focused on improving captioning models' performance. However, like other text generation tasks, it risks introducing factual errors not supported by the input video. These factual errors can seriously affect the quality of the generated text, sometimes making it completely unusable. Although factual consistency has received much research attention in text-to-text tasks (e.g., summarization), it is less studied in the context of vision-based text generation. In this work, we conduct a detailed human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field. However, existing evaluation metrics are mainly based on n-gram matching and show little correlation with human factuality annotation. We further propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning. The datasets and metrics will be released to promote future research for video captioning.
翻译:视频字幕旨在用自然语言描述视频中的事件。近年来,许多研究致力于提升字幕模型的性能。然而,与其他文本生成任务类似,它存在引入输入视频中未支持的事实错误的风险。这些事实错误会严重影响生成文本的质量,有时甚至使其完全不可用。尽管事实一致性在文本到文本任务(如摘要生成)中已得到大量研究关注,但在基于视觉的文本生成领域的相关研究仍较少。本研究对视频字幕的事实性进行了详细的人工评估,并收集了两个带注释的事实性数据集。我们发现,57.0%的模型生成句子存在事实错误,表明这一问题在该领域非常严重。然而,现有评估指标主要基于n元组匹配,与人工事实性标注的相关性较低。我们进一步提出了一种弱监督的、基于模型的事实性度量指标FactVC,其在视频字幕事实性评估中优于以往指标。我们将发布数据集和度量指标,以推动视频字幕领域的未来研究。