Unsupervised Multi-view Pedestrian Detection

With the prosperity of the video surveillance, multiple visual sensors have been applied for an accurate localization of pedestrians in a specific area, which facilitate various applications like intelligent safety or new retailing. However, previous methods rely on the supervision from the human annotated pedestrian positions in every video frame and camera view, which is a heavy burden in addition to the necessary camera calibration and synchronization. Therefore, we propose in this paper an Unsupervised Multi-view Pedestrian Detection approach (UMPD) to eliminate the need of annotations to learn a multi-view pedestrian detector. 1) Firstly, Semantic-aware Iterative Segmentation (SIS) is proposed to extract discriminative visual representations of the input images from different camera views via an unsupervised pretrained model, then convert them into 2D segments of pedestrians, based on our proposed iterative Principal Component Analysis and the zero-shot semantic classes from the vision-language pretrained models. 2) Secondly, we propose Vertical-aware Differential Rendering (VDR) to not only learn the densities and colors of 3D voxels by the masks of SIS, images and camera poses, but also constraint the voxels to be vertical towards the ground plane, following the physical characteristics of pedestrians. 3) Thirdly, the densities of 3D voxels learned by VDR are projected onto Bird-Eyes-View as the final detection results. Extensive experiments on popular multi-view pedestrian detection benchmarks, i.e., Wildtrack and MultiviewX, show that our proposed UMPD approach, as the first unsupervised method to our best knowledge, performs competitively with the previous state-of-the-art supervised techniques. Code will be available.

翻译：随着视频监控的繁荣发展，多种视觉传感器被应用于特定区域内行人的精确定位，从而促进了智能安防或新零售等多样化应用。然而，现有方法依赖于对每一视频帧和摄像头视角中人工标注的行人位置进行监督，这除了必要的相机标定与同步外还带来了沉重的负担。因此，本文提出了一种无监督多视角行人检测方法（UMPD），旨在消除学习多视角行人检测器时对标注的需求。1）首先，我们提出语义感知迭代分割（SIS），通过无监督预训练模型从不同摄像头视角的输入图像中提取判别性视觉表征，然后基于我们提出的迭代主成分分析和视觉-语言预训练模型的零样本语义类别将其转换为行人二维分段。2）其次，我们提出垂直感知差分渲染（VDR），不仅通过SIS的掩码、图像和相机位姿学习三维体素的密度和颜色，还依据行人的物理特性约束体素垂直于地面。3）最后，将VDR学习到的三维体素密度投影到鸟瞰图上作为最终检测结果。在流行的多视角行人检测基准（即Wildtrack和MultiviewX）上的大量实验表明，我们提出的UMPD方法作为据我们所知的首个无监督方法，与先前最先进的监督技术相比表现优异。代码将公开。