Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving

Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ($M^2$) data to support real-time decision-making. In practice, data quality (DQ) varies across sources and modalities due to environmental conditions and sensor limitations, yet AV research has largely prioritized algorithm design over DQ analysis. This work focuses on redundancy as a fundamental but underexplored DQ issue in AV datasets. Using the nuScenes and Argoverse 2 (AV2) datasets, we model and measure redundancy in multisource camera data and multimodal image-LiDAR data, and evaluate how removing redundant labels affects the YOLOv8 object detection task. Experimental results show that selectively removing redundant multisource image object labels from cameras with shared fields of view improves detection. In nuScenes, mAP${50}$ gains from $0.66$ to $0.70$, $0.64$ to $0.67$, and from $0.53$ to $0.55$, on three representative overlap regions, while detection on other overlapping camera pairs remains at the baseline even under stronger pruning. In AV2, $4.1$-$8.6\%$ of labels are removed, and mAP${50}$ stays near the $0.64$ baseline. Multimodal analysis also reveals substantial redundancy between image and LiDAR data. These findings demonstrate that redundancy is a measurable and actionable DQ factor with direct implications for AV performance. This work highlights the role of redundancy as a data quality factor in AV perception and motivates a data-centric perspective for evaluating and improving AV datasets. Code, data, and implementation details are publicly available at: https://github.com/yhZHOU515/RedundancyAD

翻译：下一代自动驾驶车辆依赖海量多源多模态数据以支持实时决策。实践中，由于环境条件与传感器限制，不同数据源与模态间的数据质量存在差异，然而自动驾驶研究长期将算法设计置于数据质量分析之上。本文聚焦于冗余这一自动驾驶数据集中基础但尚未充分探索的数据质量问题。基于 nuScenes 和 Argoverse 2 数据集，我们对多源摄像头数据与多模态图像-LiDAR 数据的冗余进行建模与度量，并评估去除冗余标注对 YOLOv8 目标检测任务的影响。实验结果表明：在共享视野的摄像头中有选择地去除冗余多源图像目标标注能提升检测性能。在 nuScenes 数据集的三个典型重叠区域中，mAP${50}$ 分别从 $0.66$ 提升至 $0.70$、$0.64$ 提升至 $0.67$、$0.53$ 提升至 $0.55$，而其他重叠摄像头对的检测性能即使在更强剪枝条件下仍保持基线水平。在 AV2 数据集中，$4.1$-$8.6\%$ 的标注被去除后，mAP${50}$ 仍维持在 $0.64$ 基线附近。多模态分析同时揭示了图像与 LiDAR 数据间存在显著冗余。这些发现证明冗余是可度量且可操作的数据质量因素，对自动驾驶性能具有直接影响。本研究强调了冗余作为数据质量因子在自动驾驶感知中的作用，并推动以数据为中心的视角来评估和改进自动驾驶数据集。代码、数据及实现细节已公开于：https://github.com/yhZHOU515/RedundancyAD