Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification

Advanced deep Convolutional Neural Networks (CNNs) have shown great success in video-based person Re-Identification (Re-ID). However, they usually focus on the most obvious regions of persons with a limited global representation ability. Recently, it witnesses that Transformers explore the inter-patch relations with global observations for performance improvements. In this work, we take both sides and propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Firstly, we couple CNNs and Transformers to extract two kinds of visual features and experimentally verify their complementarity. Further, in spatial, we propose a Complementary Content Attention (CCA) to take advantages of the coupled structure and guide independent features for spatial complementary learning. In temporal, a Hierarchical Temporal Aggregation (HTA) is proposed to progressively capture the inter-frame dependencies and encode temporal information. Besides, a gated attention is utilized to deliver aggregated temporal information into the CNN and Transformer branches for temporal complementary learning. Finally, we introduce a self-distillation training strategy to transfer the superior spatial-temporal knowledge to backbone networks for higher accuracy and more efficiency. In this way, two kinds of typical features from same videos are integrated mechanically for more informative representations. Extensive experiments on four public Re-ID benchmarks demonstrate that our framework could attain better performances than most state-of-the-art methods.

翻译：先进的深度卷积神经网络在视频行人重识别中已取得显著成功。然而，这些方法通常聚焦于行人最显著区域，全局表征能力有限。近期研究表明，Transformer通过全局观测探索图块间关系可提升性能。本文兼顾两者优势，提出一种名为深度耦合卷积-Transformer的新型时空互补学习框架，用于高性能视频行人重识别。首先，我们耦合CNN与Transformer提取两类视觉特征，并通过实验验证其互补性。其次，在空间维度上，提出互补内容注意力机制，利用耦合结构优势引导独立特征进行空间互补学习。在时间维度上，提出层次化时序聚合方法，逐步捕捉帧间依赖关系并编码时序信息。同时，采用门控注意力机制将聚合的时序信息传递给CNN与Transformer分支，实现时序互补学习。最后，引入自蒸馏训练策略，将优越的时空知识迁移至骨干网络，在提升精度的同时保持高效性。通过上述设计，来自同一视频的两类典型特征得以机械融合，生成更具判别力的表征。在四个公开Re-ID基准上的广泛实验表明，本框架性能优于多数现有最优方法。