Modern computer vision systems increasingly encounter performance limitations in data-scarce domains, where collecting large-scale, high-quality labeled data is costly or impractical. While controllable diffusion models enable scalable synthetic image generation, directly applying synthetic augmentation often leads to unstable performance gains due to dataset-level quality issues and insufficient feedback mechanisms. In this work, we present a Real-Calibrated Synthetic-First Data Engine, a modular data engineering framework that combines controllable diffusion generation and multi-stage curation/filtering within a unified pipeline, with optional support for uncertainty-driven selection and human verification. Instead of introducing new generative algorithms, our approach focuses on systematic dataset construction for improving the practical reliability of synthetic augmentation in low-data regimes. The framework is implemented as a modular CLI-based pipeline, where generation, filtering, selection, and validation components can be independently configured and replaced. This design emphasizes reproducibility, flexibility, and practical deployment in real-world data workflows. Through empirical evaluation centered on human pose estimation, we show that synthetic data improves a real-data baseline when used as near-zero-human-annotation-cost augmentation alongside real anchors, while synthetic-only training remains substantially below real-only performance. Supplementary segmentation diagnostics show the same domain-gap pattern. These results highlight the practical value of data-centric orchestration for low-data augmentation.
翻译:现代计算机视觉系统在数据稀缺领域日益遭遇性能瓶颈,因为在这些领域中,收集大规模、高质量的标注数据成本高昂且不切实际。虽然可控扩散模型能够实现可扩展的合成图像生成,但直接应用合成增强常因数据集层面的质量问题和反馈机制不足而导致性能提升不稳定。本研究提出一种真实校准的合成优先数据引擎,这是一个模块化的数据工程框架,将可控扩散生成与多阶段筛选/过滤整合到统一流水线中,并可选支持不确定性驱动的选择与人工验证。我们的方法并非引入新的生成算法,而是专注于系统化的数据集构建,以改善低数据条件下合成增强的实际可靠性。该框架实现为基于命令行界面的模块化流水线,其中生成、过滤、选择与验证组件可独立配置与替换。这种设计强调了可复现性、灵活性以及在真实数据工作流中的实际部署能力。通过以人体姿态估计为中心的实证评估,我们表明:当合成数据作为接近零人工标注成本的增强手段与真实锚点结合使用时,能够提升真实数据基线性能,而纯合成训练仍显著低于纯真实数据性能。辅助分割诊断显示了相同的领域差距模式。这些结果凸显了以数据为中心的低数据增强编排的实际价值。