Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint. Real-robot evaluations across four tasks (pick-place, pick-pour, wipe, and pick-give) yield an overall 90% success; sub-action inference completes in < 3.8s, with end-to-end executions in 30-60s depending on task complexity. These results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation in resource-constrained, single-camera settings.
翻译:由于计算与感知能力的限制,机器人往往难以在真实世界环境中遵循自由形式的人类指令。我们提出了一种轻量级、完全在设备端运行的流程来解决这一差距,该流程可将自然语言指令转换为可靠的操控动作。我们的方法包含两个阶段:(i) 指令到动作模块(Instruct2Act),这是一个采用多头注意力自编码器的紧凑型BiLSTM,可将指令解析为有序的原子动作序列(例如,到达、抓取、移动、放置);以及(ii) 机器人动作网络(RAN),它结合动态自适应轨迹径向网络(DATRN)与基于视觉的环境分析器(YOLOv8),为每个子动作生成精确的控制轨迹。整个系统在无需云服务的适度配置系统上运行。在我们自有的专有数据集上,Instruct2Act实现了91.5%的子动作预测准确率,同时保持较小的模型体积。在四项任务(抓取-放置、抓取-倾倒、擦拭、抓取-递送)上的真实机器人评估显示总体成功率高达90%;子动作推理在<3.8秒内完成,端到端执行时间根据任务复杂度在30-60秒之间。这些结果表明,细粒度的指令到动作解析,结合基于DATRN的轨迹生成和视觉引导的环境感知,为在资源受限、单摄像头配置下实现确定性的实时操控提供了一条实用路径。