Control of Microrobots with Reinforcement Learning under On-Device Compute Constraints

An important function of autonomous microrobots is the ability to perform robust movement over terrain. This paper explores an edge ML approach to microrobot locomotion, allowing for on-device, lower latency control under compute, memory, and power constraints. This paper explores the locomotion of a sub-centimeter quadrupedal microrobot via reinforcement learning (RL) and deploys the resulting controller on an ultra-small system-on-chip (SoC), SC$μ$M-3C, featuring an ARM Cortex-M0 microcontroller running at 5 MHz. We train a compact FP32 multilayer perceptron (MLP) policy with two hidden layers ($[128, 64]$) in a massively parallel GPU simulation and enhance robustness by utilizing domain randomization over simulation parameters. We then study integer (Int8) quantization (per-tensor and per-feature) to allow for higher inference update rates on our resource-limited hardware, and we connect hardware power budgets to achievable update frequency via a cycles-per-update model for inference on our Cortex-M0. We propose a resource-aware gait scheduling viewpoint: given a device power budget, we can select the gait mode (trot/intermediate/gallop) that maximizes expected RL reward at a corresponding feasible update frequency. Finally, we deploy our MLP policy on a real-world large-scale robot on uneven terrain, qualitatively noting that domain-randomized training can improve out-of-distribution stability. We do not claim real-world large-robot empirical zero-shot transfer in this work.

翻译：自主微型机器人的一项重要功能是能够在复杂地形上实现鲁棒运动。本文探索了一种用于微型机器人运动的边缘机器学习方法，该方法能够在计算、内存和功耗约束下实现设备端低延迟控制。本文通过强化学习研究了一款亚厘米级四足微型机器人的运动控制，并将训练得到的控制器部署在超小型系统级芯片SC$μ$M-3C上，该芯片搭载运行频率为5 MHz的ARM Cortex-M0微控制器。我们在大规模并行GPU仿真环境中训练了一个紧凑的FP32多层感知机策略网络（包含两个隐藏层$[128, 64]$），并通过仿真参数域随机化技术增强系统的鲁棒性。随后，我们研究了整数（Int8）量化方法（包括张量级和特征级量化），以在资源受限的硬件上实现更高的推理更新频率，并通过针对Cortex-M0处理器建立的"每次更新周期数"推理模型，建立了硬件功耗预算与可达更新频率之间的关联。我们提出了一种资源感知的步态调度视角：在给定设备功耗预算的条件下，可以选择能够在相应可行更新频率下最大化强化学习预期奖励的步态模式（小跑/过渡步态/疾驰）。最后，我们将训练好的多层感知机策略部署于真实世界的大型机器人上，在非平整地形中进行测试，定性指出域随机化训练能够提升分布外场景的稳定性。需要说明的是，本研究并未宣称实现了真实世界大型机器人的经验性零样本迁移。