In visible-infrared video person re-identification (re-ID), extracting features not affected by complex scenes (such as modality, camera views, pedestrian pose, background, etc.) changes, and mining and utilizing motion information are the keys to solving cross-modal pedestrian identity matching. To this end, the paper proposes a new visible-infrared video person re-ID method from a novel perspective, i.e., adversarial self-attack defense and spatial-temporal relation mining. In this work, the changes of views, posture, background and modal discrepancy are considered as the main factors that cause the perturbations of person identity features. Such interference information contained in the training samples is used as an adversarial perturbation. It performs adversarial attacks on the re-ID model during the training to make the model more robust to these unfavorable factors. The attack from the adversarial perturbation is introduced by activating the interference information contained in the input samples without generating adversarial samples, and it can be thus called adversarial self-attack. This design allows adversarial attack and defense to be integrated into one framework. This paper further proposes a spatial-temporal information-guided feature representation network to use the information in video sequences. The network cannot only extract the information contained in the video-frame sequences but also use the relation of the local information in space to guide the network to extract more robust features. The proposed method exhibits compelling performance on large-scale cross-modality video datasets. The source code of the proposed method will be released at https://github.com/lhf12278/xxx.
翻译:在可见光-红外视频行人重识别中,提取不受复杂场景(如模态、摄像头视角、行人姿态、背景等)变化影响的特征,并挖掘和利用运动信息,是解决跨模态行人身份匹配的关键。为此,本文从全新视角提出了一种可见光-红外视频行人重识别方法,即对抗性自攻击防御与时空关系挖掘。本文将视角、姿态、背景变化以及模态差异视为导致行人身份特征扰动的主要因素,并将训练样本中包含的这类干扰信息作为对抗性扰动,在训练过程中对重识别模型进行对抗攻击,从而增强模型对这些不利因素的鲁棒性。这种对抗性扰动通过激活输入样本中包含的干扰信息实现攻击,无需生成对抗样本,因此称为对抗性自攻击。该设计使得对抗攻击与防御能够整合至同一框架中。本文进一步提出一种时空信息引导的特征表示网络,以利用视频序列中的信息。该网络不仅能提取视频帧序列包含的信息,还能利用空间局部信息的关系指导网络提取更鲁棒的特征。所提方法在大规模跨模态视频数据集上展现出卓越性能。方法源代码将在https://github.com/lhf12278/xxx 发布。