Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.
翻译:数据规模化是现代深度学习的基础,随着自动驾驶转向端到端学习,其重要性日益凸显。真实驾驶数据的标注成本高昂且存在场景偏差,而利用近乎无限生成的合成数据开展真实-合成协同训练展现出广阔前景。然而,盲目引入所有可用合成数据会导致效率低下与分布偏移问题,如何在有限训练预算下优化数据混合仍是一个关键但尚未充分探索的课题。基于此,我们认为训练数据混合在场景类型与数量层面亟需明确指导。具体而言,本研究将数据混合近似描述为动态优化过程——通过闭环评估反馈的引导,迭代调整训练数据混合比例以最大化模型性能。我们提出AutoScale全自动闭环数据引擎,统一了场景表征、数据混合优化与检索、以及模型训练与评估。技术上,我们提出图正则化自编码器(Graph-RAE)用于驾驶场景表征,引入聚类感知梯度上升(Cluster-GA)实现簇级重要性估计与权重重分配,并通过聚类引导的向量检索选取高价值样本。在NavSim上的实验表明,AutoScale在约束预算下使用更少合成样本即可超越传统协同训练与跨域基线方法,取得更优性能。