In many real-world applications, it is hard to provide a reward signal in each step of a Reinforcement Learning (RL) process and more natural to give feedback when an episode ends. To this end, we study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually. Prior work studied RL-ABF only in tabular settings, where the number of states is assumed to be small. In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees: a value-based optimistic algorithm built on a new randomization technique with a Q-functions ensemble, and a policy optimization algorithm that uses a novel hedging scheme over the ensemble.
翻译:在许多实际应用中,强化学习(RL)过程中每一步都提供奖励信号较为困难,更自然的方式是在回合结束时给出反馈。为此,我们研究了近期提出的聚合赌臂反馈强化学习(RL-ABF)模型,其中智能体仅在回合结束时观察到奖励总和,而非单个奖励。以往工作仅在表格设置下研究RL-ABF(假设状态空间较小)。本文将该设置扩展至线性函数逼近,并开发了两种具有近最优遗憾保证的高效算法:基于值函数的乐观算法,采用结合Q函数集成的新随机化技术;以及基于策略优化的算法,在集成上使用新型对冲策略。