State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious hyperparameter tuning. Finally, we further investigate the practical challenges of transitioning from batch to streaming learning during finetuning and propose concrete strategies to tackle them.
翻译:最先进的深度强化学习方法在连续控制任务中取得了显著性能,但其计算复杂度往往与资源受限硬件的约束不相容,这源于其对经验回放缓冲区、批量更新和目标网络的依赖。新兴的流式深度强化学习范式通过纯在线更新解决了这一局限,在标准基准测试中实现了强大的实证性能。本文提出了两种新型流式深度强化学习算法:流式软演员-评论家与流式确定性演员-评论家,其设计明确兼容最先进的批量强化学习方法,使其特别适用于设备端微调应用(如仿真到现实迁移)。两种算法在标准基准测试中均实现了与最先进流式基线相当的性能,且无需繁琐的超参数调优。最后,我们进一步研究了微调过程中从批量学习过渡到流式学习的实际挑战,并提出了应对这些挑战的具体策略。