Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control

State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious hyperparameter tuning. Finally, we further investigate the practical challenges of transitioning from batch to streaming learning during finetuning and propose concrete strategies to tackle them.

翻译：最先进的深度强化学习方法在连续控制任务中取得了显著性能，但其计算复杂度往往与资源受限硬件的约束不相容，这源于其对经验回放缓冲区、批量更新和目标网络的依赖。新兴的流式深度强化学习范式通过纯在线更新解决了这一局限，在标准基准测试中实现了强大的实证性能。本文提出了两种新型流式深度强化学习算法：流式软演员-评论家与流式确定性演员-评论家，其设计明确兼容最先进的批量强化学习方法，使其特别适用于设备端微调应用（如仿真到现实迁移）。两种算法在标准基准测试中均实现了与最先进流式基线相当的性能，且无需繁琐的超参数调优。最后，我们进一步研究了微调过程中从批量学习过渡到流式学习的实际挑战，并提出了应对这些挑战的具体策略。

相关内容

深度强化学习

关注 156

深度强化学习 (DRL) 是一种使用深度学习技术扩展传统强化学习方法的一种机器学习方法。传统强化学习方法的主要任务是使得主体根据从环境中获得的奖赏能够学习到最大化奖赏的行为。然而，传统无模型强化学习方法需要使用函数逼近技术使得主体能够学习出值函数或者策略。在这种情况下，深度学习强大的函数逼近能力自然成为了替代人工指定特征的最好手段并为性能更好的端到端学习的实现提供了可能。

面向强化学习的可解释性研究综述

专知会员服务

44+阅读 · 2024年7月30日

脉冲强化学习算法研究综述

专知会员服务

48+阅读 · 2023年10月6日

基于深度学习的类别增量学习算法综述

专知会员服务

43+阅读 · 2023年8月10日

【干货书】基于模型的强化学习:使用python工具箱从数据到连续动作，275页pdf

专知会员服务

65+阅读 · 2022年12月21日