Real-time reinforcement learning (RL) introduces several challenges. First, policies are constrained to a fixed number of actions per second due to hardware limitations. Second, the environment may change while the network is still computing an action, leading to observational delay. The first issue can partly be addressed with pipelining, leading to higher throughput and potentially better policies. However, the second issue remains: if each neuron operates in parallel with an execution time of $\tau$, an $N$-layer feed-forward network experiences observation delay of $\tau N$. Reducing the number of layers can decrease this delay, but at the cost of the network's expressivity. In this work, we explore the trade-off between minimizing delay and network's expressivity. We present a theoretically motivated solution that leverages temporal skip connections combined with history-augmented observations. We evaluate several architectures and show that those incorporating temporal skip connections achieve strong performance across various neuron execution times, reinforcement learning algorithms, and environments, including four Mujoco tasks and all MinAtar games. Moreover, we demonstrate parallel neuron computation can accelerate inference by 6-350% on standard hardware. Our investigation into temporal skip connections and parallel computations paves the way for more efficient RL agents in real-time setting.
翻译:实时强化学习(RL)面临若干挑战。首先,由于硬件限制,策略被约束为每秒只能执行固定次数的动作。其次,在网络仍在计算动作时,环境可能发生变化,从而导致观测延迟。第一个问题可以通过流水线技术部分解决,从而提高吞吐量并可能获得更好的策略。然而,第二个问题依然存在:如果每个神经元以并行方式运行且执行时间为 $\tau$,那么一个 $N$ 层前馈网络将经历 $\tau N$ 的观测延迟。减少层数可以降低此延迟,但会以网络表达能力为代价。在本工作中,我们探索了最小化延迟与网络表达能力之间的权衡。我们提出了一种理论驱动的解决方案,该方案利用时间跳跃连接结合历史增强观测。我们评估了多种架构,结果表明,那些包含时间跳跃连接的架构在各种神经元执行时间、强化学习算法和环境中均表现出色,包括四个 Mujoco 任务和所有 MinAtar 游戏。此外,我们证明了并行神经元计算可以在标准硬件上将推理速度提升 6-350%。我们对时间跳跃连接和并行计算的研究为在实时环境中实现更高效的 RL 智能体铺平了道路。