In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain when executed in an environment formalized as a multi-armed bandit. In this paper, we focus on linear bandit setting with heteroscedastic reward noise. This is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the target policy. We term this as policy-weighted least square estimation and use this formulation to derive the optimal behavior policy for data collection. We then propose a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal behavior policy and derive its regret with respect to the optimal behavior policy. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
翻译:本文研究线性赌博机中策略评估的最优数据收集问题。在策略评估中,给定一个目标策略,需要估计其在形式化为多臂赌博机的环境中执行时获得的期望累积奖励。本文聚焦于具有异方差奖励噪声的线性赌博机设置。这是首个针对线性赌博机环境中涉及异方差奖励噪声的策略评估最优数据收集策略的研究工作。我们首先针对异方差线性赌博机设置,提出了加权最小二乘估计的最优设计,以降低目标策略的均方误差。我们将此称为策略加权最小二乘估计,并利用该公式推导出数据收集的最优行为策略。接着我们提出了一种新型算法SPEED(结构化策略评估实验设计),该算法能够追踪最优行为策略,并推导其相对于最优行为策略的遗憾值。最后,通过实验验证了SPEED能够实现与理想策略相当的均方误差,且显著低于直接运行目标策略的评估效果。