Ensuring the reliability of autonomous driving perception systems requires extensive environment-based testing, yet real-world execution is often impractical. Synthetic datasets have therefore emerged as a promising alternative, offering advantages such as cost-effectiveness, bias free labeling, and controllable scenarios. However, the domain gap between synthetic and real-world datasets remains a major obstacle to model generalization. To address this challenge from a data-centric perspective, this paper introduces a profile extraction and discovery framework for characterizing the style profiles underlying both synthetic and real image datasets. We propose Style Embedding Distribution Discrepancy (SEDD) as a novel evaluation metric. Our framework combines Gram matrix-based style extraction with metric learning optimized for intra-class compactness and inter-class separation to extract style embeddings. Furthermore, we establish a benchmark using publicly available datasets. Experiments are conducted on a variety of datasets and sim-to-real methods, and the results show that our method is capable of quantifying the synthetic-to-real gap. This work provides a standardized profiling-based quality control paradigm that enables systematic diagnosis and targeted enhancement of synthetic datasets, advancing future development of data-driven autonomous driving systems.
翻译:确保自动驾驶感知系统的可靠性需要进行广泛的环境测试,然而实际执行往往不切实际。因此,合成数据集已成为一种有前景的替代方案,具有成本效益、无偏标注和可控场景等优势。然而,合成数据集与真实世界数据集之间的领域差异仍然是模型泛化的主要障碍。为了从数据中心的视角应对这一挑战,本文提出了一种用于表征合成与真实图像数据集底层风格特征的分析框架。我们提出了风格嵌入分布差异作为新的评估指标。该框架结合了基于Gram矩阵的风格提取与优化类内紧凑性和类间分离性的度量学习,以提取风格嵌入。此外,我们利用公开可用的数据集建立了基准测试。在多种数据集和仿真到真实方法上进行的实验表明,我们的方法能够量化合成与真实数据之间的差异。这项工作提供了一个标准化的基于特征分析的质量控制范式,能够系统诊断并针对性增强合成数据集,推动未来数据驱动的自动驾驶系统发展。