In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the value of the target policy. We then use this formulation to derive the optimal allocation of samples per action during data collection. We then introduce a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal design and derive its regret with respect to the optimal design. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
翻译:本文研究线性赌博机中策略评估的最优数据收集问题。在策略评估中,给定目标策略,需估计其运行于多臂赌博机环境时的期望奖励。本文首次聚焦于线性赌博机设置下涉及异方差奖励噪声的策略评估最优数据收集策略。我们首先针对异方差线性赌博机场景,为加权最小二乘估计构建了最优设计,该设计可降低目标策略价值的均方误差;继而利用该公式推导数据收集过程中每个动作的样本最优分配;随后提出新型算法SPED(结构化策略评估实验设计)以追踪最优设计,并推导其相对于最优设计的遗憾值。最后通过实验验证,SPEED的均方误差与 oracle 策略相当,且显著低于直接运行目标策略的方法。