Reinforcement-learning (RL) agents often struggle when deployed from simulation to the real-world. A dominant strategy for reducing the sim-to-real gap is domain randomization (DR) which trains the policy across many simulators produced by sampling dynamics parameters, but standard DR ignores offline data already available from the real system. We study offline domain randomization (ODR), which first fits a distribution over simulator parameters to an offline dataset. While a growing body of empirical work reports substantial gains with algorithms such as DROPO, the theoretical foundations of ODR remain largely unexplored. In this work, we cast ODR as a maximum-likelihood estimation over a parametric simulator family and provide statistical guarantees: under mild regularity and identifiability conditions, the estimator is weakly consistent (it converges in probability to the true dynamics as data grows), and it becomes strongly consistent (i.e., it converges almost surely to the true dynamics) when an additional uniform Lipschitz continuity assumption holds. We examine the practicality of these assumptions and outline relaxations that justify ODR's applicability across a broader range of settings. Taken together, our results place ODR on a principled footing and clarify when offline data can soundly guide the choice of a randomization distribution for downstream offline RL.
翻译:强化学习(RL)智能体从仿真环境部署到现实世界时常常面临困难。缩小仿真与现实差距的一种主流策略是领域随机化(DR),该方法通过在采样的动力学参数所生成的多个仿真器中训练策略,但标准DR忽略了从真实系统已可获取的离线数据。我们研究离线领域随机化(ODR),该方法首先根据离线数据集拟合仿真器参数上的分布。尽管越来越多的实证研究(例如DROPO等算法)报告了显著的性能提升,但ODR的理论基础在很大程度上仍未得到探索。在本工作中,我们将ODR建模为参数化仿真器族上的最大似然估计,并提供统计保证:在温和的正则性和可识别性条件下,该估计量具有弱一致性(随着数据量的增加,它以概率收敛于真实动力学),并且当附加一致Lipschitz连续性假设成立时,它变为强一致性(即几乎必然收敛于真实动力学)。我们检验了这些假设的实用性,并概述了放宽条件,以证明ODR在更广泛场景中的适用性。综合来看,我们的结果为ODR奠定了原则性基础,并阐明了离线数据何时能够可靠地指导下游离线RL中随机化分布的选择。