TVPR: Text-to-Video Person Retrieval and a New Benchmark

Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured in isolated frames or variable motion details are given in the textual description. In this paper, we propose a new task called Text-to-Video Person Retrieval(TVPR) which aims to effectively overcome the limitations of isolated frames. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, such as person's appearance, actions and interactions with environment, etc., termed as Text-to-Video Person Re-identification (TVPReid) dataset, which will be publicly available. To this end, a Text-to-Video Person Retrieval Network (TVPRN) is proposed. Specifically, TVPRN acquires video representations by fusing visual and motion representations of person videos, which can deal with temporal occlusion and the absence of variable motion details in isolated frames. Meanwhile, we employ the pre-trained BERT to obtain caption representations and the relationship between caption and video representations to reveal the most relevant person videos. To evaluate the effectiveness of the proposed TVPRN, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, TVPRN is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.

翻译：大多数现有的基于文本的行人检索方法主要关注文本到图像的行人检索。然而，由于单帧图像缺乏动态信息，当行人在单帧中模糊不清或文本描述中包含变化的运动细节时，性能会受到影响。本文提出了一项名为“文本到视频行人检索”（TVPR）的新任务，旨在有效克服单帧图像的局限性。由于目前尚不存在用自然语言描述行人视频的数据集或基准，我们构建了一个大规模的跨模态行人视频数据集，其中包含详细的自然语言标注（如行人的外貌、动作及与环境的互动等），命名为文本到视频行人重识别数据集（TVPReid），该数据集将公开提供。为此，我们提出了一种文本到视频行人检索网络（TVPRN）。具体而言，TVPRN通过融合行人视频的视觉表示和运动表示来获取视频表示，从而处理单帧图像中的时间遮挡和缺少变化的运动细节问题。同时，我们利用预训练的BERT获取字幕表示，并通过字幕表示与视频表示之间的关系来揭示最相关的行人视频。为评估所提TVPRN的有效性，我们在TVPReid数据集上进行了大量实验。据我们所知，TVPRN是首次成功将视频用于基于文本的行人检索任务的尝试，并在TVPReid数据集上取得了最优性能。TVPReid数据集将公开提供，以促进未来研究。