Scaling data volume and diversity is critical for generalizing embodied intelligence. While synthetic data generation offers a scalable alternative to expensive physical data acquisition, existing pipelines remain fragmented and task-specific. This isolation leads to significant engineering inefficiency and system instability, failing to support the sustained, high-throughput data generation required for foundation model training. To address these challenges, we present Nimbus, a unified synthetic data generation framework designed to integrate heterogeneous navigation and manipulation pipelines. Nimbus introduces a modular four-layer architecture featuring a decoupled execution model that separates trajectory planning, rendering, and storage into asynchronous stages. By implementing dynamic pipeline scheduling, global load balancing, distributed fault tolerance, and backend-specific rendering optimizations, the system maximizes resource utilization across CPU, GPU, and I/O resources. Our evaluation demonstrates that Nimbus achieves a 2-3X improvement in end-to-end throughput compared to unoptimized baselines and ensuring robust, long-term operation in large-scale distributed environments. This framework serves as the production backbone for the InternData suite, enabling seamless cross-domain data synthesis.
翻译:扩展数据规模与多样性对于泛化具身智能至关重要。尽管合成数据生成为昂贵的物理数据采集提供了可扩展的替代方案,但现有流程仍处于碎片化且任务特定的状态。这种隔离导致了显著的工程低效与系统不稳定,无法满足基础模型训练所需的持续高吞吐量数据生成需求。为应对这些挑战,我们提出了Nimbus,一个旨在整合异构导航与操作流程的统一合成数据生成框架。Nimbus采用模块化的四层架构,其解耦执行模型将轨迹规划、渲染与存储分离至异步阶段。通过实现动态流程调度、全局负载均衡、分布式容错以及后端特定渲染优化,该系统最大化利用了CPU、GPU及I/O资源。评估结果表明,相较于未优化的基线,Nimbus实现了端到端吞吐量2-3倍的提升,并确保在大规模分布式环境中稳健的长期运行。该框架作为InternData套件的生产骨干,实现了无缝的跨领域数据合成。