Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
翻译:视觉-语言-动作任务需要在动态环境中对复杂视觉场景进行推理并执行适应性动作。尽管近期关于推理型视觉-语言-动作模型的研究表明,显式的思维链能够提升泛化能力,但由于冗长的推理轨迹,它们存在较高的推理延迟。我们提出了Fast-ThinkAct,一种高效的推理框架,它通过可言语化的潜在推理实现了紧凑且高性能的规划。Fast-ThinkAct通过从教师模型蒸馏学习,在偏好引导目标的驱动下,学习利用潜在思维链进行高效推理,该目标旨在对齐操作轨迹,从而迁移语言和视觉规划能力以进行具身控制。这使得推理增强的策略学习能够有效地将紧凑的推理与动作执行连接起来。在多样化的具身操作与推理基准测试上进行的大量实验表明,Fast-ThinkAct实现了强大的性能,与最先进的推理型视觉-语言-动作模型相比,推理延迟最高降低了89.3%,同时保持了有效的长时程规划、少样本适应和故障恢复能力。