Video captioning automatically generates short descriptions of the video content, usually in form of a single sentence. Many methods have been proposed for solving this task. A large dataset called MSR Video to Text (MSR-VTT) is often used as the benchmark dataset for testing the performance of the methods. However, we found that the human annotations, i.e., the descriptions of video contents in the dataset are quite noisy, e.g., there are many duplicate captions and many captions contain grammatical problems. These problems may pose difficulties to video captioning models for learning underlying patterns. We cleaned the MSR-VTT annotations by removing these problems, then tested several typical video captioning models on the cleaned dataset. Experimental results showed that data cleaning boosted the performances of the models measured by popular quantitative metrics. We recruited subjects to evaluate the results of a model trained on the original and cleaned datasets. The human behavior experiment demonstrated that trained on the cleaned dataset, the model generated captions that were more coherent and more relevant to the contents of the video clips.
翻译:视频描述任务旨在自动生成视频内容的简短描述,通常以单句形式呈现。针对该任务已提出多种方法,其中名为MSR Video to Text(MSR-VTT)的大型数据集常被用作评估方法性能的基准数据集。然而我们发现,该数据集的人工标注(即视频内容描述)存在较多噪声,例如大量重复描述以及诸多语法问题。这些问题可能给视频描述模型学习底层模式带来困难。我们通过移除这些问题对MSR-VTT标注进行了清洗,随后在清洗后的数据集上测试了若干典型视频描述模型。实验结果表明,数据清洗提升了模型在主流量化指标上的表现。我们招募受试者对基于原始数据集和清洗数据集训练的模型结果进行评价。人类行为实验证实,基于清洗数据集训练的模型生成的描述在连贯性和与视频片段内容的相关性方面均更为优异。