Active search for recovering objects of interest through online, adaptive decision making with autonomous agents requires trading off exploration of unknown environments with exploitation of prior observations in the search space. Prior work has proposed information gain and Thompson sampling based myopic, greedy approaches for agents to actively decide query or search locations when the number of targets is unknown. Decision making algorithms in such partially observable environments have also shown that agents capable of lookahead over a finite horizon outperform myopic policies for active search. Unfortunately, lookahead algorithms typically rely on building a computationally expensive search tree that is simulated and updated based on the agent's observations and a model of the environment dynamics. Instead, in this work, we leverage the sequence modeling abilities of diffusion models to sample lookahead action sequences that balance the exploration-exploitation trade-off for active search without building an exhaustive search tree. We identify the optimism bias in prior diffusion based reinforcement learning approaches when applied to the active search setting and propose mitigating solutions for efficient cost-aware decision making with both single and multi-agent teams. Our proposed algorithm outperforms standard baselines in offline reinforcement learning in terms of full recovery rate and is computationally more efficient than tree search in cost-aware active decision making.
翻译:通过自主智能体进行在线自适应决策以恢复感兴趣对象的主动搜索,需要在搜索空间中权衡未知环境的探索与已有观测的利用。先前研究提出了基于信息增益和汤普森采样的短视贪婪方法,使智能体在目标数量未知时能主动决定查询或搜索位置。在此类部分可观测环境中的决策算法还表明,具备有限前瞻能力的智能体在主动搜索中表现优于短视策略。然而,前瞻算法通常依赖于构建计算成本高昂的搜索树,该树需根据智能体观测和环境动态模型进行模拟更新。本研究转而利用扩散模型的序列建模能力,通过采样前瞻动作序列来平衡主动搜索中的探索-利用权衡,而无需构建穷举式搜索树。我们发现了先前基于扩散的强化学习方法在主动搜索场景中存在的乐观偏差,并提出了适用于单智能体与多智能体团队的高效成本感知决策缓解方案。所提算法在离线强化学习中,其完全恢复率优于标准基线方法,且在成本感知主动决策中的计算效率高于树搜索算法。