A common setting in multitask reinforcement learning (RL) demands that an agent rapidly adapt to various stationary reward functions randomly sampled from a fixed distribution. In such situations, the successor representation (SR) is a popular framework which supports rapid policy evaluation by decoupling a policy's expected discounted, cumulative state occupancies from a specific reward function. However, in the natural world, sequential tasks are rarely independent, and instead reflect shifting priorities based on the availability and subjective perception of rewarding stimuli. Reflecting this disjunction, in this paper we study the phenomenon of diminishing marginal utility and introduce a novel state representation, the $\lambda$ representation ($\lambda$R) which, surprisingly, is required for policy evaluation in this setting and which generalizes the SR as well as several other state representations from the literature. We establish the $\lambda$R's formal properties and examine its normative advantages in the context of machine learning, as well as its usefulness for studying natural behaviors, particularly foraging.
翻译:在多任务强化学习中,一个常见设定要求智能体快速适应从固定分布中随机采样的各种稳态奖励函数。在此类情境下,后继表示(SR)作为一种流行框架,通过将策略的期望折扣累积状态占据与该策略在特定奖励函数下的性能解耦,支持快速策略评估。然而,在自然世界中,序列任务很少是独立的,相反,它们反映了基于奖励刺激的可获得性与主观感知的动态优先级变化。针对这一不匹配现象,本文研究了边际效用递减现象,并引入了一种新颖的状态表示——$\lambda$表示($\lambda$R)。令人惊讶的是,该表示在此设定下是策略评估所必需的,且其推广了SR及文献中其他若干状态表示。我们确立了$\lambda$R的形式化性质,考察了其在机器学习语境下的规范性优势,以及其在研究自然行为(特别是觅食行为)中的实用性。