Logistic bandit is a ubiquitous framework of modeling users' choices, e.g., click vs. no click for advertisement recommender system. We observe that the prior works overlook or neglect dependencies in $S \geq \lVert \theta_\star \rVert_2$, where $\theta_\star \in \mathbb{R}^d$ is the unknown parameter vector, which is particularly problematic when $S$ is large, e.g., $S \geq d$. In this work, we improve the dependency on $S$ via a novel approach called {\it regret-to-confidence set conversion (R2CS)}, which allows us to construct a convex confidence set based on only the \textit{existence} of an online learning algorithm with a regret guarantee. Using R2CS, we obtain a strict improvement in the regret bound w.r.t. $S$ in logistic bandits while retaining computational feasibility and the dependence on other factors such as $d$ and $T$. We apply our new confidence set to the regret analyses of logistic bandits with a new martingale concentration step that circumvents an additional factor of $S$. We then extend this analysis to multinomial logistic bandits and obtain similar improvements in the regret, showing the efficacy of R2CS. While we applied R2CS to the (multinomial) logistic model, R2CS is a generic approach for developing confidence sets that can be used for various models, which can be of independent interest.
翻译:逻辑斯蒂赌博机是建模用户选择(如广告推荐系统中的点击与未点击)的通用框架。我们发现先前的研究忽略或未充分利用 $S \geq \lVert \theta_\star \rVert_2$ 中的依赖关系,其中 $\theta_\star \in \mathbb{R}^d$ 为未知参数向量,当 $S$ 较大(例如 $S \geq d$)时这一问题尤为突出。本文通过一种称为“遗憾-置信集转换”(R2CS) 的创新方法改进了对 $S$ 的依赖关系,该方法仅需基于一个具有遗憾保证的在线学习算法的存在性即可构建凸置信集。利用 R2CS,我们在逻辑斯蒂赌博机中实现了关于 $S$ 的遗憾界的严格改进,同时保持了计算可行性以及对 $d$ 和 $T$ 等其他因素的依赖关系。我们将新的置信集应用于逻辑斯蒂赌博机的遗憾分析,并引入了一个规避 $S$ 额外因子的新型鞅浓度步骤。随后我们将该分析扩展至多项逻辑斯蒂赌博机,获得了类似的遗憾改进,验证了 R2CS 的有效性。尽管我们将 R2CS 应用于(多项)逻辑斯蒂模型,但该方法是一种通用的置信集构建框架,可适用于多种模型,具有独立的研究价值。