MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation

Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drift. Existing approaches make different trade-offs: action-chunking policies such as ACT enable low-latency execution and data efficiency but rely on dense visual features without explicit spatial consistency, generative methods such as Diffusion Policy improve expressiveness but can incur iterative sampling latency, vision-language-action and voxel-based methods enhance generalization and geometric grounding but require higher computational cost and system complexity. We introduce a multistage spatial attention module that extracts stable 2D attention points and jointly predicts future attention sequences with a temporal alignment loss. Built upon ACT with a pretrained ResNet visual prior, a multistage attention module extracts task-relevant 2D attention points as a local spatial modality for action prediction. To maintain consistent object tracking, we introduce a self-supervised objective that aligns predicted attention sequences with visual features from future frames, suppressing drift without keypoint annotations and improving stability of the vision-to-action mapping under limited data. Experiments on simulated and real-world fine manipulation tasks, conducted on the ALOHA bimanual platform, evaluate task success, attention drift, inference latency, and robustness to visual disturbances. Results indicate improvements in localization stability and task performance while maintaining low-latency inference under the tested conditions.

翻译：现实世界中的精细操作任务，特别是双臂协同操作场景，通常要求低时延控制与稳定的视觉定位，而大规模数据采集成本高昂且有限的演示样本可能导致定位漂移。现有方法在性能之间做出了不同权衡：基于动作分块策略（如ACT）可实现低时延执行与数据高效性，但依赖无显式空间一致性的稠密视觉特征；基于生成式方法（如扩散策略）虽能增强表达能力，但会引入迭代采样时延；视觉-语言-动作方法与体素方法虽提升了泛化能力与几何基础，却需要更高的计算开销与系统复杂度。本文提出一种多阶段空间注意力模块，该模块能够提取稳定的二维注意力点，并通过时序对齐损失联合预测未来注意力序列。基于搭载预训练ResNet视觉先验的ACT框架，多阶段注意力模块提取任务相关的二维注意力点作为动作预测的局部空间模态。为维持目标跟踪一致性，我们引入自监督目标函数，将预测的注意力序列与未来帧视觉特征对齐，在无需关键点标注的情况下抑制漂移，并提升有限数据条件下视觉-动作映射的稳定性。在ALOHA双臂操作平台上开展的仿真与现实精细操作实验，评估了任务成功率、注意力漂移、推理时延及视觉扰动鲁棒性。结果表明，在保持低时延推理的前提下，所提方法在定位稳定性与任务性能方面均获得提升。