We explore a stochastic contextual linear bandit problem where the agent observes a noisy, corrupted version of the true context through a noise channel with an unknown noise parameter. Our objective is to design an action policy that can approximate" that of an oracle, which has access to the reward model, the channel parameter, and the predictive distribution of the true context from the observed noisy context. In a Bayesian framework, we introduce a Thompson sampling algorithm for Gaussian bandits with Gaussian context noise. Adopting an information-theoretic analysis, we demonstrate the Bayesian regret of our algorithm concerning the oracle's action policy. We also extend this problem to a scenario where the agent observes the true context with some delay after receiving the reward and show that delayed true contexts lead to lower Bayesian regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.
翻译:我们研究了一个随机上下文线性赌博机问题,其中智能体通过一个未知噪声参数的噪声信道观测到真实上下文的含噪、失真的版本。我们的目标是设计一种行动策略,使其能够逼近一个“神谕”的策略,该神谕可以访问奖励模型、信道参数以及从观测到的含噪上下文中得到的真实上下文预测分布。在贝叶斯框架下,我们针对高斯赌博机和高斯上下文噪声提出了一种汤普森采样算法。通过采用信息论分析,我们证明了我们的算法相对于神谕行动策略的贝叶斯遗憾。我们还将该问题扩展到智能体在获得奖励后延迟观测到真实上下文的场景,并表明延迟的真实上下文会导致更低的贝叶斯遗憾。最后,我们通过实验证明了所提算法相对于基线的性能。