In online exploration systems where users with fixed preferences repeatedly arrive, it has recently been shown that O(1), i.e., bounded regret, can be achieved when the system is modeled as a linear contextual bandit. This result may be of interest for recommender systems, where the popularity of their items is often short-lived, as the exploration itself may be completed quickly before potential long-run non-stationarities come into play. However, in practice, exact knowledge of the linear model is difficult to justify. Furthermore, potential existence of unobservable covariates, uneven user arrival rates, interpretation of the necessary rank condition, and users opting out of private data tracking all need to be addressed for practical recommender system applications. In this work, we conduct a theoretical study to address all these issues while still achieving bounded regret. Aside from proof techniques, the key differentiating assumption we make here is the presence of effective Synthetic Control Methods (SCM), which are shown to be a practical relaxation of the exact linear model knowledge assumption. We verify our theoretical bounded regret result using a minimal simulation experiment.
翻译:在线探索系统中,当用户具有固定偏好且重复到达时,最近的研究表明,若系统建模为线性上下文Bandit问题,可实现O(1)(即有界)遗憾。这一结果对推荐系统具有重要意义——由于物品流行度常呈现短生命周期特性,探索过程可在潜在长期非平稳性发挥作用前快速完成。然而在实际推荐系统应用中,精确掌握线性模型难以实现;此外,未观测协变量的潜在存在、用户到达率的不均匀性、必要秩条件的可解释性,以及用户拒绝私有数据追踪等现实问题均需得到解决。本研究通过理论分析攻克上述难题,同时保持有界遗憾的实现。与现有证明技术的核心差异在于,我们提出的关键假设是存在有效的合成对照方法——该方法已被证明是对精确线性模型知识假设的实用松弛方案。最后,我们通过最小仿真实验验证了理论有界遗憾结论。