We propose a linear contextual bandit algorithm with $O(\sqrt{dT\log T})$ regret bound, where $d$ is the dimension of contexts and $T$ isthe time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contributions either from contexts of all arms or from selected contexts. We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into \textit{additive} dimension-dependent terms instead of multiplicative terms. We also prove a novel lower bound of $\Omega(\sqrt{dT})$ under our problem setting. Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors. The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.
翻译:我们提出了一种具有 $O(\sqrt{dT\log T})$ 遗憾界的线性上下文赌博机算法,其中 $d$ 为上下文维度,$T$ 为时间范围。该算法配备了一种新型估计器,通过显式随机化内嵌探索机制。根据随机化策略,我们的估计器将来自所有臂的上下文贡献或来自选定上下文的贡献纳入考量。我们建立了该估计器的自归一化界,从而能够将累积遗憾创新性地分解为 **加性** 维度相关项(而非乘性项)。同时,我们在该问题设定下证明了新的下界 $\Omega(\sqrt{dT})$。因此,我们提出算法的遗憾值在仅相差对数因子的意义上匹配该下界。数值实验支持了理论保证,并表明我们的方法优于现有线性赌博机算法。