Automatically narrating a video with natural language can assist people in grasping and managing massive videos on the Internet. From the perspective of video uploaders, they may have varied preferences for writing the desired video description to attract more potential followers, e.g. catching customers' attention for product videos. The Controllable Video Captioning task is therefore proposed to generate a description conditioned on the user demand and video content. However, existing works suffer from two shortcomings: 1) the control signal is fixed and can only express single-grained control; 2) the video description can not be further edited to meet dynamic user demands. In this paper, we propose a novel Video Description Editing (VDEdit) task to automatically revise an existing video description guided by flexible user requests. Inspired by human writing-revision habits, we design the user command as a {operation, position, attribute} triplet to cover multi-grained use requirements, which can express coarse-grained control (e.g. expand the description) as well as fine-grained control (e.g. add specified details in specified position) in a unified format. To facilitate the VDEdit task, we first automatically construct a large-scale benchmark dataset namely VATEX-EDIT in the open domain describing diverse human activities. Considering the real-life application scenario, we further manually collect an e-commerce benchmark dataset called EMMAD-EDIT. We propose a unified framework to convert the {operation, position, attribute} triplet into a textual control sequence to handle multi-grained editing commands. For VDEdit evaluation, we adopt comprehensive metrics to measure three aspects of model performance, including caption quality, caption-command consistency, and caption-video alignment.
翻译:自动为视频生成自然语言描述有助于用户理解和管控互联网上海量视频内容。从视频上传者角度出发,他们可能对撰写理想视频描述存在多样化偏好以吸引更多潜在关注者,例如(通过描述)吸引消费者对产品视频的关注。为此,可控视频描述生成任务应运而生,旨在根据用户需求与视频内容生成描述文本。然而现有工作存在两点不足:1)控制信号固定不变,仅能表达单一粒度的控制意图;2)视频描述无法被进一步编辑以满足动态变化的用户需求。本文提出新型视频描述编辑(VDEdit)任务,旨在根据灵活的用户请求自动修正现有视频描述。受人类写作-修改习惯启发,我们将用户指令设计为{操作、位置、属性}三元组以覆盖多粒度使用需求,该格式可统一表达粗粒度控制(如扩展描述)与细粒度控制(如在指定位置添加特定细节)。为支撑VDEdit任务,我们首先在描述多样化人类活动的开放域中自动构建大规模基准数据集VATEX-EDIT;同时考虑真实应用场景,进一步人工构建电子商务基准数据集EMMAD-EDIT。我们提出统一框架将{操作、位置、属性}三元组转化为文本控制序列,以处理多粒度编辑指令。在VDEdit评估方面,采用综合指标从描述质量、描述-指令一致性及描述-视频对齐三个维度衡量模型性能。