Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.
翻译:超远距离视频行人重识别(ReID)因尺度压缩、分辨率退化、运动模糊及地空视角不匹配而极具挑战性。随着相机高度和主体距离增加,基于近景图像训练的模型性能显著下降。本研究探索如何使大规模视觉-语言模型在此类条件下稳定运行。基于CLIP基线,我们将视觉骨干网络从ViT-B/16升级至ViT-L/14,并引入骨干感知的选择性微调机制以稳定大容量Transformer的自适应过程。针对含噪声和低分辨率轨迹片段,我们提出轻量级时序注意力池化模块,可抑制退化帧并强化有效观测信息。通过保留基于适配器与提示条件的跨视角学习策略缓解地空域偏移,并借助改进优化算法与k倒数重排序进一步优化检索结果。在DetReIDX压力测试基准上的实验表明,本方法在A2G、G2A和A2A任务上分别达到46.69%、41.23%和22.98%的mAP分数,综合mAP为35.73%。结果表明,结合稳定性导向的自适应策略后,大规模视觉-语言骨干网络能显著增强超远距离视频行人重识别的鲁棒性。