We explore a stochastic contextual linear bandit problem where the agent observes a noisy, corrupted version of the true context through a noise channel with an unknown noise parameter. Our objective is to design an action policy that can approximate" that of an oracle, which has access to the reward model, the channel parameter, and the predictive distribution of the true context from the observed noisy context. In a Bayesian framework, we introduce a Thompson sampling algorithm for Gaussian bandits with Gaussian context noise. Adopting an information-theoretic analysis, we demonstrate the Bayesian regret of our algorithm concerning the oracle's action policy. We also extend this problem to a scenario where the agent observes the true context with some delay after receiving the reward and show that delayed true contexts lead to lower Bayesian regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.
翻译:我们研究了一个随机上下文线性赌博机问题,其中智能体通过一个具有未知噪声参数的噪声信道,观察到真实上下文的带噪声、受破坏版本。我们的目标是设计一种行动策略,使其能够近似一个“预言机”的策略,该预言机可以访问奖励模型、信道参数以及从观测到的噪声上下文中得到的真实上下文的预测分布。在贝叶斯框架下,我们针对带有高斯上下文噪声的高斯赌博机,提出了一种汤普森采样算法。采用信息论分析,我们展示了算法相对于预言机行动策略的贝叶斯遗憾。我们还将该问题扩展到智能体在获得奖励后延迟观测到真实上下文的情景,并证明延迟的真实上下文会导致更低的贝叶斯遗憾。最后,我们通过实验展示了所提算法相对于基线的性能表现。