ParGANDA: Making Synthetic Pedestrians A Reality For Object Detection

Object detection is the key technique to a number of Computer Vision applications, but it often requires large amounts of annotated data to achieve decent results. Moreover, for pedestrian detection specifically, the collected data might contain some personally identifiable information (PII), which is highly restricted in many countries. This label intensive and privacy concerning task has recently led to an increasing interest in training the detection models using synthetically generated pedestrian datasets collected with a photo-realistic video game engine. The engine is able to generate unlimited amounts of data with precise and consistent annotations, which gives potential for significant gains in the real-world applications. However, the use of synthetic data for training introduces a synthetic-to-real domain shift aggravating the final performance. To close the gap between the real and synthetic data, we propose to use a Generative Adversarial Network (GAN), which performsparameterized unpaired image-to-image translation to generate more realistic images. The key benefit of using the GAN is its intrinsic preference of low-level changes to geometric ones, which means annotations of a given synthetic image remain accurate even after domain translation is performed thus eliminating the need for labeling real data. We extensively experimented with the proposed method using MOTSynth dataset to train and MOT17 and MOT20 detection datasets to test, with experimental results demonstrating the effectiveness of this method. Our approach not only produces visually plausible samples but also does not require any labels of the real domain thus making it applicable to the variety of downstream tasks.

翻译：目标检测是众多计算机视觉应用的核心技术，但通常需要大量标注数据才能获得理想效果。特别在行人检测任务中，采集的数据可能包含个人身份信息（PII），这在许多国家受到严格限制。这一标注密集且涉及隐私问题的任务，近期促使研究者们日益关注利用基于真实感视频游戏引擎生成的合成行人数据集来训练检测模型。该引擎能够生成无限量且具有精确一致标注的数据，为现实应用带来显著性能提升的潜力。然而，使用合成数据进行训练会引入"合成-真实域偏移"，从而影响最终性能。为弥合真实数据与合成数据之间的差距，我们提出采用生成对抗网络（GAN）执行参数化的非配对图像到图像转换，以生成更真实的图像。使用GAN的核心优势在于其内在偏好低级几何变化而非高级语义变化，这意味着即使经过域转换，给定合成图像的标注仍保持准确，从而消除了标注真实数据的必要性。我们基于MOTSynth数据集进行训练，并在MOT17和MOT20检测数据集上开展充分实验，实验结果验证了该方法的有效性。本方法不仅生成在视觉上合理的样本，且无需任何真实域标注，使其可适用于各类下游任务。