We present Self-Adaptive Robust Attention for Robotics Transformers (SARA-RT): a new paradigm for addressing the emerging challenge of scaling up Robotics Transformers (RT) for on-robot deployment. SARA-RT relies on the new method of fine-tuning proposed by us, called up-training. It converts pre-trained or already fine-tuned Transformer-based robotic policies of quadratic time complexity (including massive billion-parameter vision-language-action models or VLAs), into their efficient linear-attention counterparts maintaining high quality. We demonstrate the effectiveness of SARA-RT by speeding up: (a) the class of recently introduced RT-2 models, the first VLA robotic policies pre-trained on internet-scale data, as well as (b) Point Cloud Transformer (PCT) robotic policies operating on large point clouds. We complement our results with the rigorous mathematical analysis providing deeper insight into the phenomenon of SARA.
翻译:我们提出面向机器人Transformer的自适应鲁棒注意力机制(SARA-RT):一种应对机器人Transformer(RT)在机载部署中规模化挑战的新范式。SARA-RT依赖于我们提出的新型微调方法——上训练(up-training)。该方法将预训练或已微调的二次时间复杂度Transformer类机器人策略(包括大规模十亿参数视觉-语言-动作模型VLA)转化为保持高质量的高效线性注意力模型。我们通过加速以下两类模型验证了SARA-RT的有效性:(a)近期提出的RT-2模型——首批在互联网规模数据上预训练的VLA机器人策略;(b)处理大规模点云的点云Transformer(PCT)机器人策略。我们通过严谨的数学分析对SARA现象机理进行深入阐释,进一步佐证了实验结果。