Consider this scenario: an agent navigates a latent graph by performing actions that take it from one node to another. The chosen action determines the probability distribution over the next visited node. At each node, the agent receives an observation, but this observation is not unique, so it does not identify the node, making the problem aliased. The purpose of this work is to provide a policy that approximately maximizes exploration efficiency (i.e., how well the graph is recovered for a given exploration budget). In the unaliased case, we show improved performance w.r.t. state-of-the-art reinforcement learning baselines. For the aliased case we are not aware of suitable baselines and instead show faster recovery w.r.t. a random policy for a wide variety of topologies, and exponentially faster recovery than a random policy for challenging topologies. We dub the algorithm eFeX (from eFficient eXploration).
翻译:考虑以下场景:智能体通过执行从一个节点到另一个节点的动作来导航潜在图结构。所选动作决定了下一访问节点的概率分布。在每个节点上,智能体接收到一个观测结果,但该观测结果并非唯一,因此无法识别节点身份,导致问题存在含混性。本研究的目标是提供一种策略,使其在给定探索预算下近似最大化探索效率(即图结构恢复的完备程度)。在无含混情形下,我们展示了相较于当前最优强化学习基线的性能提升。对于含混情形,由于缺乏合适的基线方法,我们通过与随机策略的对比,证明了所提方法在多种拓扑结构下能实现更快的图结构恢复,并在具有挑战性的拓扑结构中实现指数级加速。我们将该算法命名为eFeX(efficient eXploration的缩写)。