While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a VLM provides semantic priors for environmental dynamics, such as mass and friction estimates, which are then explicitly refined in real time via online system identification, while the LLM iteratively modulates the cost-function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as flipping objects against walls by leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.
翻译:摘要:尽管大型语言模型和视觉-语言模型在高层次推理与语义理解方面展现出卓越能力,但由于缺乏显式的物理基础且无法实现自适应控制,将其直接应用于接触丰富的操控任务仍面临挑战。为弥合这一鸿沟,我们提出CoRAL(基于大型语言模型的接触丰富自适应控制)模块化框架,通过解耦高层推理与底层控制实现零样本规划。与黑箱策略不同,CoRAL并非将大型语言模型作为直接控制器,而是将其作为代价设计师——为基于采样的运动规划器合成上下文感知的目标函数。为解决视觉数据中物理参数的模糊性,我们引入神经符号自适应循环:视觉-语言模型为环境动力学(如质量与摩擦系数估计)提供语义先验,并通过在线系统辨识实时显式优化;同时,大型语言模型基于交互反馈迭代调节代价函数结构以修正策略性错误。此外,基于检索的记忆单元使系统可在重复任务中复用成功策略。这种分层架构通过分离高层语义推理与反应式执行,确保实时控制稳定性,有效弥合大型语言模型慢速推理与动态接触需求之间的间隙。我们在仿真与真实硬件平台上验证了CoRAL在具有挑战性的新任务中的表现,例如通过利用外部接触实现物体靠墙翻转。实验表明,CoRAL在未见过的接触丰富场景中平均成功率达50%以上,优于当前最先进的视觉-语言-动作基础模型规划基线方法,并通过自适应物理理解有效处理了仿真到现实的迁移差距。