Monte Carlo (MC) methods are the most widely used methods to estimate the performance of a policy. Given an interested policy, MC methods give estimates by repeatedly running this policy to collect samples and taking the average of the outcomes. Samples collected during this process are called online samples. To get an accurate estimate, MC methods consume massive online samples. When online samples are expensive, e.g., online recommendations and inventory management, we want to reduce the number of online samples while achieving the same estimate accuracy. To this end, we use off-policy MC methods that evaluate the interested policy by running a different policy called behavior policy. We design a tailored behavior policy such that the variance of the off-policy MC estimator is provably smaller than the ordinary MC estimator. Importantly, this tailored behavior policy can be efficiently learned from existing offline data, i,e., previously logged data, which are much cheaper than online samples. With reduced variance, our off-policy MC method requires fewer online samples to evaluate the performance of a policy compared with the ordinary MC method. Moreover, our off-policy MC estimator is always unbiased.
翻译:蒙特卡洛(MC)方法是评估策略性能最常用的方法。对于给定的目标策略,MC方法通过反复运行该策略以收集样本,并取结果的平均值来给出估计值。此过程中收集的样本称为在线样本。为获得精确估计,MC方法需要消耗大量在线样本。当在线样本成本高昂时(例如在线推荐和库存管理),我们希望在不牺牲估计精度的前提下减少在线样本数量。为此,我们采用离策略MC方法,通过运行一种称为行为策略的不同策略来评估目标策略。我们设计了一种定制行为策略,使得离策略MC估计量的方差在理论上小于普通MC估计量。重要的是,这种定制行为策略可从现有离线数据(即先前记录的日志数据)中高效学习,这些数据比在线样本成本低得多。由于方差减小,相较于普通MC方法,我们的离策略MC方法只需更少的在线样本即可评估策略性能。此外,我们的离策略MC估计量始终是无偏的。