SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

翻译：机器学习模型的性能在很大程度上依赖于训练数据。大规模、高质量标注数据集的稀缺对构建鲁棒模型构成了重大挑战。为此，通过模拟仿真和生成模型产生的合成数据已成为一种有前景的解决方案，能够增强数据集的多样性，提升模型的性能、可靠性和鲁棒性。然而，评估生成数据的质量需要有效的度量指标。我们提出了合成数据集质量指标（SDQM），用于评估目标检测任务的数据质量，无需等待模型训练收敛。该指标能够更高效地生成和选择合成数据集，解决了资源受限目标检测任务中的关键挑战。实验表明，SDQM与主流目标检测模型YOLO11的平均精度均值（mAP）得分呈现强相关性，而以往的指标仅表现出中等或弱相关性。此外，该指标还能为提升数据集质量提供可操作的建议，最大限度地减少成本高昂的迭代训练需求。这种可扩展且高效的指标为评估合成数据设立了新标准。SDQM代码可从https://github.com/ayushzenith/SDQM获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《美陆军虚拟自主导航环境合成数据质量评估工具与技术分析》最新69页报告

专知会员服务

27+阅读 · 2025年6月3日

《利用合成数据生成加强军事决策支持》

专知会员服务

43+阅读 · 2024年12月30日

大规模语言模型生成的合成数据中的质量、多样性与复杂性效应综述

专知会员服务

32+阅读 · 2024年12月10日

【CMU博士论文】优化的新视角：应对数据中毒、解决欧几里得优化问题，以及学习最小最大最优估计器。

专知会员服务

20+阅读 · 2024年12月5日