A Real-Calibrated Synthetic-First Data Engine

Modern computer vision systems increasingly encounter performance limitations in data-scarce domains, where collecting large-scale, high-quality labeled data is costly or impractical. While controllable diffusion models enable scalable synthetic image generation, directly applying synthetic augmentation often leads to unstable performance gains due to dataset-level quality issues and insufficient feedback mechanisms. In this work, we present a Real-Calibrated Synthetic-First Data Engine, a modular data engineering framework that combines controllable diffusion generation and multi-stage curation/filtering within a unified pipeline, with optional support for uncertainty-driven selection and human verification. Instead of introducing new generative algorithms, our approach focuses on systematic dataset construction for improving the practical reliability of synthetic augmentation in low-data regimes. The framework is implemented as a modular CLI-based pipeline, where generation, filtering, selection, and validation components can be independently configured and replaced. This design emphasizes reproducibility, flexibility, and practical deployment in real-world data workflows. Through empirical evaluation centered on human pose estimation, we show that synthetic data improves a real-data baseline when used as near-zero-human-annotation-cost augmentation alongside real anchors, while synthetic-only training remains substantially below real-only performance. Supplementary segmentation diagnostics show the same domain-gap pattern. These results highlight the practical value of data-centric orchestration for low-data augmentation.

翻译：现代计算机视觉系统在数据稀缺领域日益遭遇性能瓶颈，因为在这些领域中，收集大规模、高质量的标注数据成本高昂且不切实际。虽然可控扩散模型能够实现可扩展的合成图像生成，但直接应用合成增强常因数据集层面的质量问题和反馈机制不足而导致性能提升不稳定。本研究提出一种真实校准的合成优先数据引擎，这是一个模块化的数据工程框架，将可控扩散生成与多阶段筛选/过滤整合到统一流水线中，并可选支持不确定性驱动的选择与人工验证。我们的方法并非引入新的生成算法，而是专注于系统化的数据集构建，以改善低数据条件下合成增强的实际可靠性。该框架实现为基于命令行界面的模块化流水线，其中生成、过滤、选择与验证组件可独立配置与替换。这种设计强调了可复现性、灵活性以及在真实数据工作流中的实际部署能力。通过以人体姿态估计为中心的实证评估，我们表明：当合成数据作为接近零人工标注成本的增强手段与真实锚点结合使用时，能够提升真实数据基线性能，而纯合成训练仍显著低于纯真实数据性能。辅助分割诊断显示了相同的领域差距模式。这些结果凸显了以数据为中心的低数据增强编排的实际价值。

相关内容

Engineering

关注 7

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

【伯克利博士论文】面向大规模视图合成的深度生成先验

专知会员服务

13+阅读 · 2025年9月3日

《美陆军虚拟自主导航环境合成数据质量评估工具与技术分析》最新69页报告

专知会员服务

27+阅读 · 2025年6月3日

《探索军事决策支持系统中合成数据的保真度》

专知会员服务

40+阅读 · 2025年2月28日

《利用合成数据生成加强军事决策支持》

专知会员服务

43+阅读 · 2024年12月30日