Learning agents that excel at sequential decision-making tasks must continuously resolve the problem of exploration and exploitation for optimal learning. However, such interactions with the environment online might be prohibitively expensive and may involve some constraints, such as a limited budget for agent-environment interactions and restricted exploration in certain regions of the state space. Examples include selecting candidates for medical trials and training agents in complex navigation environments. This problem necessitates the study of active reinforcement learning strategies that collect minimal additional experience trajectories by reusing existing offline data previously collected by some unknown behavior policy. In this work, we propose a representation-aware uncertainty-based active trajectory collection method that intelligently decides interaction strategies that consider the distribution of the existing offline data. With extensive experimentation, we demonstrate that our proposed method reduces additional online interaction with the environment by up to 75% over competitive baselines across various continuous control environments.
翻译:擅长序列决策任务的学习智能体必须持续解决探索与利用的权衡问题以实现最优学习。然而,这种在线环境交互可能代价高昂,且常伴随诸多约束,例如智能体与环境交互的有限预算、状态空间特定区域的探索限制等。典型应用场景包括医学试验候选者筛选与复杂导航环境中智能体的训练。该问题促使我们研究主动强化学习策略,通过复用先前由未知行为策略收集的现有离线数据,以最小化额外经验轨迹的收集。本文提出一种基于表示感知不确定性的主动轨迹收集方法,该方法能智能决策交互策略,并充分考虑现有离线数据的分布特性。通过大量实验验证,我们在多种连续控制环境中证明,相较于现有竞争基线方法,所提方法能将额外的在线环境交互需求降低高达75%。