Deploying reinforcement learning in the real world remains challenging due to sample inefficiency, sparse rewards, and noisy visual observations. Prior work leverages demonstrations and human feedback to improve learning efficiency and robustness. However, offline-to-online methods need large datasets and can be unstable, while VLA-assisted RL relies on large-scale pretraining and fine-tuning. As a result, a low-cost real-world RL method with minimal data requirements has yet to emerge. We introduce \textbf{SigEnt-SAC}, an off-policy actor-critic method that learns from scratch using a single expert trajectory. Our key design is a sigmoid-bounded entropy term that prevents negative-entropy-driven optimization toward out-of-distribution actions and reduces Q-function oscillations. We benchmark SigEnt-SAC on D4RL tasks against representative baselines. Experiments show that SigEnt-SAC substantially alleviates Q-function oscillations and reaches a 100\% success rate faster than prior methods. Finally, we validate SigEnt-SAC on four real-world robotic tasks across multiple embodiments, where agents learn from raw images and sparse rewards; results demonstrate that SigEnt-SAC can learn successful policies with only a small number of real-world interactions, suggesting a low-cost and practical pathway for real-world RL deployment.
翻译:在现实世界中部署强化学习仍然面临样本效率低、奖励稀疏和视觉观测噪声大等挑战。先前的研究通过利用示范数据和人类反馈来提高学习效率和鲁棒性。然而,离线到在线方法需要大规模数据集且可能不稳定,而视觉语言模型辅助的强化学习则依赖于大规模预训练和微调。因此,一种数据需求极少的低成本现实世界强化学习方法尚未出现。我们提出了\textbf{SigEnt-SAC},这是一种离策略行动者-评论家方法,能够仅使用单条专家轨迹从头开始学习。我们的核心设计是一个sigmoid有界熵项,该设计能防止负熵驱动的优化过程偏离分布动作,并减少Q函数振荡。我们在D4RL任务上对SigEnt-SAC与代表性基线方法进行了基准测试。实验表明,SigEnt-SAC显著缓解了Q函数振荡,并比现有方法更快达到100\%的成功率。最后,我们在多个机器人实体上通过四个真实世界机器人任务验证了SigEnt-SAC,其中智能体从原始图像和稀疏奖励中学习;结果表明SigEnt-SAC仅需少量真实世界交互即可学习到成功策略,这为现实世界强化学习的部署提供了一条低成本且实用的路径。