An agent in a nonstationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a nonstationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and noncontextual nonstationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional nonstationary bandit algorithms. Although it is very difficult to provide theoretical performance guarantees for our new approach, we also prove a novel regret bound for linear posterior sampling with measurement error that may serve as a foundation for future theoretical work.
翻译:在非平稳情境赌博机问题中,智能体需在探索与利用先前经验中的(周期性或结构化)模式之间取得平衡。手工设计恰当的历史情境是将非平稳问题转化为可高效求解的平稳问题的一种有吸引力的替代方案。然而,即使经过精心设计的历史情境也可能引入虚假关联,或缺乏关键信息的便捷表示。为解决这些问题,我们提出一种仅基于智能体与环境交互的原始历史记录来学习决策相关情境表示的方法。该方法将循环神经网络提取的特征与基于后验采样的情境线性赌博机算法相结合。我们在多种情境与非情境非平稳问题上的实验表明,与需要手工设计历史情境的前馈对应方法相比,我们的递归方法持续表现出更优性能,同时比传统非平稳赌博机算法具有更广泛的适用性。尽管为我们的新方法提供理论性能保证极具挑战性,但我们还证明了带有测量误差的线性后验采样的新型遗憾界,这可能为未来理论工作奠定基础。