Agents trained with DQN rely on an observation at each timestep to decide what action to take next. However, in real world applications observations can change or be missing entirely. Examples of this could be a light bulb breaking down, or the wallpaper in a certain room changing. While these situations change the actual observation, the underlying optimal policy does not change. Because of this we want our agent to continue taking actions until it receives a (recognized) observation again. To achieve this we introduce a combination of a neural network architecture that uses hidden representations of the observations and a novel n-step loss function. Our implementation is able to withstand location based blindness stretches longer than the ones it was trained on, and therefore shows robustness to temporary blindness. For access to our implementation, please email Nathan, Marije, or Pau.
翻译:采用DQN训练的代理依赖于每个时间步的观测来决定下一步动作。然而在实际应用中,观测可能发生变化或完全缺失,例如灯泡损坏或特定房间的壁纸更换。这些情况虽然改变了实际观测,但潜在的最优策略并未改变。为此,我们希望代理在再次获得(可识别的)观测之前能够持续执行动作。我们提出了一种结合神经网络架构(利用观测的隐层表示)与新型n步损失函数的方案。该实现能够承受远超训练时长的基于位置的视觉缺失时段,因而展现出对临时视觉缺失的鲁棒性。如需获取实现代码,请通过电子邮件联系Nathan、Marije或Pau。