In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the value of the target policy. We then use this formulation to derive the optimal allocation of samples per action during data collection. We then introduce a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal design and derive its regret with respect to the optimal design. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
翻译:本文研究线性赌博机中策略评估的最优数据收集问题。在策略评估中,给定一个目标策略,需要估计其在多臂赌博机环境中执行时所能获得的期望奖励。本文首次聚焦于线性赌博机设定下涉及异方差奖励噪声的策略评估最优数据收集策略。我们首先针对异方差线性赌博机设定,构建了加权最小二乘估计的最优设计,以降低目标策略价值的均方误差。进而利用该公式推导出数据收集过程中每个动作的最优样本分配方案。随后,我们提出一种新型算法SPED(结构化策略评估实验设计),该算法能追踪最优设计,并给出其相对于最优设计的遗憾上界。最后,通过实验验证,SPED进行策略评估时的均方误差可与理想策略相媲美,且显著低于简单执行目标策略的方法。