The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM
翻译:机器学习模型的性能在很大程度上依赖于训练数据。大规模、高质量标注数据集的稀缺对构建鲁棒模型构成了重大挑战。为此,通过模拟仿真和生成模型产生的合成数据已成为一种有前景的解决方案,能够增强数据集的多样性,提升模型的性能、可靠性和鲁棒性。然而,评估生成数据的质量需要有效的度量指标。我们提出了合成数据集质量指标(SDQM),用于评估目标检测任务的数据质量,无需等待模型训练收敛。该指标能够更高效地生成和选择合成数据集,解决了资源受限目标检测任务中的关键挑战。实验表明,SDQM与主流目标检测模型YOLO11的平均精度均值(mAP)得分呈现强相关性,而以往的指标仅表现出中等或弱相关性。此外,该指标还能为提升数据集质量提供可操作的建议,最大限度地减少成本高昂的迭代训练需求。这种可扩展且高效的指标为评估合成数据设立了新标准。SDQM代码可从https://github.com/ayushzenith/SDQM获取。