Monte Carlo (MC) methods are the most widely used methods to estimate the performance of a policy. Given an interested policy, MC methods give estimates by repeatedly running this policy to collect samples and taking the average of the outcomes. Samples collected during this process are called online samples. To get an accurate estimate, MC methods consume massive online samples. When online samples are expensive, e.g., online recommendations and inventory management, we want to reduce the number of online samples while achieving the same estimate accuracy. To this end, we use off-policy MC methods that evaluate the interested policy by running a different policy called behavior policy. We design a tailored behavior policy such that the variance of the off-policy MC estimator is provably smaller than the ordinary MC estimator. Importantly, this tailored behavior policy can be efficiently learned from existing offline data, i,e., previously logged data, which are much cheaper than online samples. With reduced variance, our off-policy MC method requires fewer online samples to evaluate the performance of a policy compared with the ordinary MC method. Moreover, our off-policy MC estimator is always unbiased.
翻译:蒙特卡洛(MC)方法是最广泛用于评估策略性能的方法。针对目标策略,MC方法通过重复运行该策略以收集样本并取结果的平均值来进行估计。此过程中收集的样本称为在线样本。为获得精确估计,MC方法需要大量在线样本。当在线样本成本高昂时(例如在线推荐和库存管理),我们希望在保持相同估计精度的同时减少在线样本数量。为此,我们采用离策略MC方法,通过运行不同的行为策略来评估目标策略。我们设计了一种定制化的行为策略,使得离策略MC估计量的方差理论上小于普通MC估计量。关键在于,这种定制化行为策略可以从现有的离线数据(即先前记录的数据)中高效学习,这类数据比在线样本便宜得多。由于方差降低,与普通MC方法相比,我们的离策略MC方法在评估策略性能时所需的在线样本更少。此外,我们的离策略MC估计量始终是无偏的。