Making online decisions can be challenging when features are sparse and orthogonal to historical ones, especially when the optimal policy is learned through collaborative filtering. We formulate the problem as a matrix completion bandit (MCB), where the expected reward under each arm is characterized by an unknown low-rank matrix. The $\epsilon$-greedy bandit and the online gradient descent algorithm are explored. Policy learning and regret performance are studied under a specific schedule for exploration probabilities and step sizes. A faster decaying exploration probability yields smaller regret but learns the optimal policy less accurately. We investigate an online debiasing method based on inverse propensity weighting (IPW) and a general framework for online policy inference. The IPW-based estimators are asymptotically normal under mild arm-optimality conditions. Numerical simulations corroborate our theoretical findings. Our methods are applied to the San Francisco parking pricing project data, revealing intriguing discoveries and outperforming the benchmark policy.
翻译:当特征稀疏且与历史特征正交时,在线决策具有挑战性,尤其是通过协同过滤学习最优策略的情况。我们将该问题建模为矩阵补全赌博机(MCB),其中每臂的期望奖励由未知的低秩矩阵表征。本文研究了ε-贪心赌博机和在线梯度下降算法,并在特定的探索概率与步长调度下分析了策略学习与遗憾性能。探索概率衰减越快,遗憾值越小,但最优策略的学习精度越低。我们提出了一种基于逆倾向加权(IPW)的在线去偏方法,并构建了在线策略推断的一般框架。在温和的臂最优性条件下,基于IPW的估计量具有渐近正态性。数值模拟验证了理论结果。将该方法应用于旧金山停车定价项目数据,揭示了有趣的发现,并优于基准策略。