Autonomous driving has rapidly developed and shown promising performance due to recent advances in hardware and deep learning techniques. High-quality datasets are fundamental for developing reliable autonomous driving algorithms. Previous dataset surveys either focused on a limited number or lacked detailed investigation of dataset characteristics. To this end, we present an exhaustive study of 265 autonomous driving datasets from multiple perspectives, including sensor modalities, data size, tasks, and contextual conditions. We introduce a novel metric to evaluate the impact of datasets, which can also be a guide for creating new datasets. Besides, we analyze the annotation processes, existing labeling tools, and the annotation quality of datasets, showing the importance of establishing a standard annotation pipeline. On the other hand, we thoroughly analyze the impact of geographical and adversarial environmental conditions on the performance of autonomous driving systems. Moreover, we exhibit the data distribution of several vital datasets and discuss their pros and cons accordingly. Finally, we discuss the current challenges and the development trend of the future autonomous driving datasets.
翻译:自动驾驶技术因硬件和深度学习技术的进步而迅速发展,展现了令人瞩目的性能。高质量数据集是开发可靠自动驾驶算法的基础。现有的数据集综述要么仅关注有限数量的数据集,要么缺乏对数据集特性的深入探究。为此,我们从传感器模态、数据规模、任务类型及环境条件等多个维度,对265个自动驾驶数据集进行了详尽研究。我们提出了一种新颖的指标来评估数据集的影响力,该指标也可为创建新数据集提供指导。此外,我们分析了标注流程、现有标注工具及数据集的标注质量,揭示了建立标准化标注流水线的重要性。另一方面,我们深入分析了地理环境和对抗性环境条件对自动驾驶系统性能的影响。同时,我们展示了若干关键数据集的数据分布,并据此讨论了各自的优缺点。最后,我们探讨了当前面临的挑战以及未来自动驾驶数据集的发展趋势。