TVPR: Text-to-Video Person Retrieval and a New Benchmark

Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured in isolated frames or variable motion details are given in the textual description. In this paper, we propose a new task called Text-to-Video Person Retrieval(TVPR) which aims to effectively overcome the limitations of isolated frames. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, such as person's appearance, actions and interactions with environment, etc., termed as Text-to-Video Person Re-identification (TVPReid) dataset, which will be publicly available. To this end, a Text-to-Video Person Retrieval Network (TVPRN) is proposed. Specifically, TVPRN acquires video representations by fusing visual and motion representations of person videos, which can deal with temporal occlusion and the absence of variable motion details in isolated frames. Meanwhile, we employ the pre-trained BERT to obtain caption representations and the relationship between caption and video representations to reveal the most relevant person videos. To evaluate the effectiveness of the proposed TVPRN, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, TVPRN is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.

翻译：现有基于文本的人物检索方法大多聚焦于文本到图像的人物检索。然而，由于孤立帧缺乏动态信息，当人物在孤立帧中被遮挡或文本描述中给出变化的运动细节时，检索性能会受到影响。本文提出一项新任务——文本到视频的人物检索（TVPR），旨在有效克服孤立帧的局限性。由于目前尚无通过自然语言描述人物视频的数据集或基准，我们构建了一个大型跨模态人物视频数据集，包含详细的自然语言标注（如人物外貌、行为及与环境的交互等），命名为文本到视频人物重识别数据集（TVPReid），该数据集将公开提供。为此，我们提出一个文本到视频人物检索网络（TVPRN）。具体而言，TVPRN通过融合人物视频的视觉表征和运动表征来获取视频表征，从而处理时序遮挡及孤立帧中缺乏变化运动细节的问题。同时，我们采用预训练的BERT模型获取描述文本表征，并利用文本表征与视频表征之间的关系来找出最相关的人物视频。为评估所提TVPRN的有效性，我们在TVPReid数据集上进行了大量实验。据我们所知，TVPRN是首次成功将视频应用于基于文本的人物检索任务的尝试，并在TVPReid数据集上取得了最先进的性能。TVPReid数据集将公开提供，以惠及未来研究。