In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.
翻译:在强化学习中,智能体与动态变化的环境进行序贯交互,旨在最大化获得的累积奖励。通常,奖励仅在动作执行后才可观测,因此目标是最大化期望累积奖励。然而,在许多实际场景中,奖励信息可提前获取——交易前可观测价格;周边交通信息部分已知;目标常在交互前就已赋予智能体。本研究旨在通过竞争分析视角量化分析此类未来奖励信息的价值。具体而言,我们衡量标准强化学习智能体与具备部分未来奖励预知能力的智能体之间的价值比率。我们刻画了最坏情况下的奖励分布,并推导出最坏情况奖励期望的精确比率。令人意外的是,所得比率与离线强化学习及免奖励探索中的已知量存在关联。我们进一步给出了最坏情况动态下的比率紧界。研究结果覆盖了从动作前观测即时奖励到交互前观测所有奖励的完整频谱。