This paper investigates regret minimization, statistical inference, and their interplay in high-dimensional online decision-making based on the sparse linear context bandit model. We integrate the $\varepsilon$-greedy bandit algorithm for decision-making with a hard thresholding algorithm for estimating sparse bandit parameters and introduce an inference framework based on a debiasing method using inverse propensity weighting. Under a margin condition, our method achieves either $O(T^{1/2})$ regret or classical $O(T^{1/2})$-consistent inference, indicating an unavoidable trade-off between exploration and exploitation. If a diverse covariate condition holds, we demonstrate that a pure-greedy bandit algorithm, i.e., exploration-free, combined with a debiased estimator based on average weighting can simultaneously achieve optimal $O(\log T)$ regret and $O(T^{1/2})$-consistent inference. We also show that a simple sample mean estimator can provide valid inference for the optimal policy's value. Numerical simulations and experiments on Warfarin dosing data validate the effectiveness of our methods.
翻译:本文基于稀疏线性上下文赌博机模型,研究了高维在线决策中的遗憾最小化、统计推断及其相互作用。我们将用于决策的 $\varepsilon$-贪心赌博机算法与用于估计稀疏赌博机参数的硬阈值算法相结合,并引入了一种基于逆倾向加权去偏方法的推断框架。在边界条件下,我们的方法要么实现 $O(T^{1/2})$ 的遗憾,要么实现经典的 $O(T^{1/2})$ 一致性推断,这表明了探索与利用之间不可避免的权衡。若满足多样化协变量条件,我们证明纯贪心赌博机算法(即无探索)结合基于平均加权的去偏估计器,可以同时实现最优的 $O(\log T)$ 遗憾和 $O(T^{1/2})$ 一致性推断。我们还表明,一个简单的样本均值估计器可以为最优策略的价值提供有效的推断。在数值模拟和华法林剂量数据上的实验验证了我们方法的有效性。