RADIUM: Predicting and Repairing End-to-End Robot Failures using Gradient-Accelerated Sampling

Before autonomous systems can be deployed in safety-critical applications, we must be able to understand and verify the safety of these systems. For cases where the risk or cost of real-world testing is prohibitive, we propose a simulation-based framework for a) predicting ways in which an autonomous system is likely to fail and b) automatically adjusting the system's design and control policy to preemptively mitigate those failures. Existing tools for failure prediction struggle to search over high-dimensional environmental parameters, cannot efficiently handle end-to-end testing for systems with vision in the loop, and provide little guidance on how to mitigate failures once they are discovered. We approach this problem through the lens of approximate Bayesian inference and use differentiable simulation and rendering for efficient failure case prediction and repair. For cases where a differentiable simulator is not available, we provide a gradient-free version of our algorithm, and we include a theoretical and empirical evaluation of the trade-offs between gradient-based and gradient-free methods. We apply our approach on a range of robotics and control problems, including optimizing search patterns for robot swarms, UAV formation control, and robust network control. Compared to optimization-based falsification methods, our method predicts a more diverse, representative set of failure modes, and we find that our use of differentiable simulation yields solutions that have up to 10x lower cost and requires up to 2x fewer iterations to converge relative to gradient-free techniques. In hardware experiments, we find that repairing control policies using our method leads to a 5x robustness improvement. Accompanying code and video can be found at https://mit-realm.github.io/radium/

翻译：在将自主系统部署于安全关键应用之前，我们必须能够理解并验证这些系统的安全性。针对实际测试风险或成本过高的情况，我们提出一种基于仿真的框架，用于：a) 预测自主系统可能发生故障的方式，b) 自动调整系统设计与控制策略，以预先缓解这些故障。现有故障预测工具在搜索高维环境参数时存在困难，无法高效处理包含视觉回路的端到端测试，且在发现故障后缺乏缓解措施的指导。我们通过近似贝叶斯推断的视角解决该问题，利用可微仿真与渲染实现高效的故障案例预测与修复。对于无法获得可微仿真器的情况，我们提供算法对应的无梯度版本，并从理论与实验两方面评估基于梯度与无梯度方法的权衡。我们将该方法应用于一系列机器人与控制问题，包括优化机器人集群搜索模式、无人机编队控制及鲁棒网络控制。与基于优化的伪造方法相比，我们的方法能预测更多样化、更具代表性的故障模式；实验表明，使用可微仿真得到的解相比无梯度技术成本降低高达10倍，且收敛所需迭代次数减少至多2倍。在硬件实验中，使用我们的方法修复控制策略可实现5倍的鲁棒性提升。配套代码与视频见https://mit-realm.github.io/radium/