In visible-infrared video person re-identification (re-ID), extracting features not affected by complex scenes (such as modality, camera views, pedestrian pose, background, etc.) changes, and mining and utilizing motion information are the keys to solving cross-modal pedestrian identity matching. To this end, the paper proposes a new visible-infrared video person re-ID method from a novel perspective, i.e., adversarial self-attack defense and spatial-temporal relation mining. In this work, the changes of views, posture, background and modal discrepancy are considered as the main factors that cause the perturbations of person identity features. Such interference information contained in the training samples is used as an adversarial perturbation. It performs adversarial attacks on the re-ID model during the training to make the model more robust to these unfavorable factors. The attack from the adversarial perturbation is introduced by activating the interference information contained in the input samples without generating adversarial samples, and it can be thus called adversarial self-attack. This design allows adversarial attack and defense to be integrated into one framework. This paper further proposes a spatial-temporal information-guided feature representation network to use the information in video sequences. The network cannot only extract the information contained in the video-frame sequences but also use the relation of the local information in space to guide the network to extract more robust features. The proposed method exhibits compelling performance on large-scale cross-modality video datasets. The source code of the proposed method will be released at https://github.com/lhf12278/xxx.
翻译:在可见-红外视频行人重识别中,提取不受复杂场景(如模态、摄像机视角、行人姿态、背景等)变化影响的特征,并挖掘和利用运动信息,是解决跨模态行人身份匹配的关键。为此,本文从一个新颖的角度提出了一种新的可见-红外视频行人重识别方法,即对抗性自攻击防御与时空关系挖掘。本文将视角、姿态、背景的变化以及模态差异视为导致行人身份特征扰动的主要因素,并将训练样本中包含的这些干扰信息用作对抗性扰动。该扰动在训练过程中对重识别模型进行对抗攻击,使模型对这些不利因素更具鲁棒性。通过激活输入样本中包含的干扰信息来引入对抗扰动的攻击,而无需生成对抗样本,因此可称为对抗性自攻击。这种设计将对抗攻击与防御集成于同一框架中。本文进一步提出了一种时空信息引导的特征表示网络,以利用视频序列中的信息。该网络不仅能提取视频帧序列中包含的信息,还能利用空间局部信息的关系来引导网络提取更鲁棒的特征。所提出的方法在大规模跨模态视频数据集上表现出令人信服的性能。该方法的源代码将在https://github.com/lhf12278/xxx上发布。