基于统计查询的机器人性能可重复性度量方法再思考 (Rethink Repeatable Measures of Robot Performance with Statistical Query)

For a general standardized testing algorithm designed to evaluate a specific aspect of a robot's performance, several key expectations are commonly imposed. Beyond accuracy (i.e., closeness to a typically unknown ground-truth reference) and efficiency (i.e., feasibility within acceptable testing costs and equipment constraints), one particularly important attribute is repeatability. Repeatability refers to the ability to consistently obtain the same testing outcome when similar testing algorithms are executed on the same subject robot by different stakeholders, across different times or locations. However, achieving repeatable testing has become increasingly challenging as the components involved grow more complex, intelligent, diverse, and, most importantly, stochastic. While related efforts have addressed repeatability at ethical, hardware, and procedural levels, this study focuses specifically on repeatable testing at the algorithmic level. Specifically, we target the well-adopted class of testing algorithms in standardized evaluation: statistical query (SQ) algorithms (i.e., algorithms that estimate the expected value of a bounded function over a distribution using sampled data). We propose a lightweight, parameterized, and adaptive modification applicable to any SQ routine, whether based on Monte Carlo sampling, importance sampling, or adaptive importance sampling, that makes it provably repeatable, with guaranteed bounds on both accuracy and efficiency. We demonstrate the effectiveness of the proposed approach across three representative scenarios: (i) established and widely adopted standardized testing of manipulators, (ii) emerging intelligent testing algorithms for operational risk assessment in automated vehicles, and (iii) developing use cases involving command tracking performance evaluation of humanoid robots in locomotion tasks.

翻译：对于一个旨在评估机器人特定性能方面的通用标准化测试算法，通常对其施加若干关键期望。除了准确性（即与通常未知的真实参考值的接近程度）和效率（即可在可接受的测试成本与设备限制内实施的可行性）之外，一个尤为重要的属性是可重复性。可重复性指的是当不同利益相关者在不同时间或地点，对同一受试机器人执行相似的测试算法时，能够一致地获得相同测试结果的能力。然而，随着所涉及组件变得日益复杂、智能、多样化，并且最重要的是具有随机性，实现可重复测试已变得越来越具有挑战性。虽然相关研究已在伦理、硬件和程序层面探讨了可重复性问题，但本研究特别关注算法层面的可重复测试。具体而言，我们针对标准化评估中广泛采用的一类测试算法：统计查询（SQ）算法（即利用采样数据估计有界函数在分布上的期望值的算法）。我们提出了一种轻量级、参数化且自适应的修改方案，适用于任何SQ例程（无论是基于蒙特卡洛采样、重要性采样还是自适应重要性采样），使其在理论上可证明具有可重复性，并同时保证准确性和效率的界限。我们在三个代表性场景中验证了所提方法的有效性：（i）已确立且广泛采用的机械臂标准化测试，（ii）用于自动驾驶车辆运行风险评估的新兴智能测试算法，以及（iii）涉及人形机器人在运动任务中指令跟踪性能评估的开发中用例。