Real-world decision-making requires grappling with a perpetual lack of data as environments change; intelligent agents must comprehend uncertainty and actively gather information to resolve it. We propose a new framework for learning bandit algorithms from massive historical data, which we demonstrate in a cold-start recommendation problem. First, we use historical data to pretrain an autoregressive model to predict a sequence of repeated feedback/rewards (e.g., responses to news articles shown to different users over time). In learning to make accurate predictions, the model implicitly learns an informed prior based on rich action features (e.g., article headlines) and how to sharpen beliefs as more rewards are gathered (e.g., clicks as each article is recommended). At decision-time, we autoregressively sample (impute) an imagined sequence of rewards for each action, and choose the action with the largest average imputed reward. Far from a heuristic, our approach is an implementation of Thompson sampling (with a learned prior), a prominent active exploration algorithm. We prove our pretraining loss directly controls online decision-making performance, and we demonstrate our framework on a news recommendation task where we integrate end-to-end fine-tuning of a pretrained language model to process news article headline text to improve performance.
翻译:现实世界中的决策需要应对环境变化带来的持续数据匮乏问题;智能体必须理解不确定性并主动收集信息以解决不确定性。我们提出了一种从海量历史数据中学习赌博机算法的新框架,并在冷启动推荐问题中验证了该方法。首先,我们利用历史数据预训练一个自回归模型来预测重复反馈/奖励序列(例如,随时间推移向不同用户展示新闻文章获得的响应)。在学习进行准确预测的过程中,该模型隐式地学习了基于丰富动作特征(如新闻标题)的信息化先验,以及如何随着收集更多奖励(例如每次文章推荐获得的点击)来强化信念。在决策阶段,我们通过自回归方式为每个动作采样(估算)一个想象的奖励序列,并选择具有最大平均估算奖励的动作。我们的方法远非启发式算法,而是汤普森采样(基于学习先验)这一重要主动探索算法的具体实现。我们证明了预训练损失直接控制在线决策性能,并在新闻推荐任务中验证了该框架,通过端到端微调预训练语言模型来处理新闻标题文本以提升性能。