Tracklet quality is often treated as an afterthought in most person re-identification (ReID) methods, with the majority of research presenting architectural modifications to foundational models. Such approaches neglect an important limitation, posing challenges when deploying ReID systems in real-world, difficult scenarios. In this paper, we introduce S3-CLIP, a video super-resolution-based CLIP-ReID framework developed for the VReID-XFD challenge at WACV 2026. The proposed method integrates recent advances in super-resolution networks with task-driven super-resolution pipelines, adapting them to the video-based person re-identification setting. To the best of our knowledge, this work represents the first systematic investigation of video super-resolution as a means of enhancing tracklet quality for person ReID, particularly under challenging cross-view conditions. Experimental results demonstrate performance competitive with the baseline, achieving 37.52% mAP in aerial-to-ground and 29.16% mAP in ground-to-aerial scenarios. In the ground-to-aerial setting, S3-CLIP achieves substantial gains in ranking accuracy, improving Rank-1, Rank-5, and Rank-10 performance by 11.24%, 13.48%, and 17.98%, respectively.
翻译:在大多数行人重识别方法中,轨迹片段质量常被视为次要问题,主流研究多集中于对基础模型进行架构改进。这类方法忽视了一个重要的局限性,导致ReID系统在现实世界的复杂场景中部署时面临挑战。本文提出S3-CLIP,一种基于视频超分辨率的CLIP-ReID框架,专为WACV 2026的VReID-XFD挑战赛而开发。该方法将超分辨率网络的最新进展与任务驱动的超分辨率流程相结合,使其适应基于视频的行人重识别场景。据我们所知,本研究首次系统性地探索了利用视频超分辨率提升行人ReID轨迹片段质量的方法,尤其在具有挑战性的跨视角条件下。实验结果表明,该方法取得了与基线模型竞争的性能,在空对地场景中达到37.52% mAP,在地对空场景中达到29.16% mAP。在地对空场景中,S3-CLIP在排序准确率上取得显著提升,其Rank-1、Rank-5和Rank-10性能分别提高了11.24%、13.48%和17.98%。