Instruct2Act：通过机器人动作网络实现从人类指令到动作序列编排与执行的机器人操控 (Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation)

Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint. Real-robot evaluations across four tasks (pick-place, pick-pour, wipe, and pick-give) yield an overall 90% success; sub-action inference completes in < 3.8s, with end-to-end executions in 30-60s depending on task complexity. These results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation in resource-constrained, single-camera settings.

翻译：由于计算与感知能力的限制，机器人往往难以在真实世界环境中遵循自由形式的人类指令。我们提出了一种轻量级、完全在设备端运行的流程来解决这一差距，该流程可将自然语言指令转换为可靠的操控动作。我们的方法包含两个阶段：(i) 指令到动作模块（Instruct2Act），这是一个采用多头注意力自编码器的紧凑型BiLSTM，可将指令解析为有序的原子动作序列（例如，到达、抓取、移动、放置）；以及(ii) 机器人动作网络（RAN），它结合动态自适应轨迹径向网络（DATRN）与基于视觉的环境分析器（YOLOv8），为每个子动作生成精确的控制轨迹。整个系统在无需云服务的适度配置系统上运行。在我们自有的专有数据集上，Instruct2Act实现了91.5%的子动作预测准确率，同时保持较小的模型体积。在四项任务（抓取-放置、抓取-倾倒、擦拭、抓取-递送）上的真实机器人评估显示总体成功率高达90%；子动作推理在<3.8秒内完成，端到端执行时间根据任务复杂度在30-60秒之间。这些结果表明，细粒度的指令到动作解析，结合基于DATRN的轨迹生成和视觉引导的环境感知，为在资源受限、单摄像头配置下实现确定性的实时操控提供了一条实用路径。