Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one. Code is avaliable at https://github.com/col14m/cadrille .
翻译:计算机辅助设计(CAD)在工程与制造领域占据核心地位,能够创建精确且可编辑的三维模型。利用各类传感器或用户提供的数据作为CAD重建的输入,可降低设计应用的使用门槛。然而,现有方法通常仅关注单一输入模态(如点云、图像或文本),限制了其泛化能力与鲁棒性。借助视觉-语言模型(VLM)的最新进展,我们提出了一种多模态CAD重建模型,能够同时处理上述三种输入模态。受大语言模型(LLM)训练范式的启发,我们采用两阶段流程:首先在大规模程序生成数据上进行监督微调(SFT),随后利用程序化获取的在线反馈进行强化学习(RL)微调。此外,我们首次探索了针对CAD任务的LLM强化学习微调,证明在线RL算法(如组相对偏好优化GRPO)优于离线替代方案。在DeepCAD基准测试中,我们的SFT模型在三种输入模态上均同时优于现有单模态方法。更重要的是,经过RL微调后,cadrille在三个具有挑战性的数据集(包括真实世界数据集)上取得了最先进的性能。代码发布于https://github.com/col14m/cadrille。