We focus on an unloading problem, typical of the logistics sector, modeled as a sequential pick-and-place task. In this type of task, modern machine learning techniques have shown to work better than classic systems since they are more adaptable to stochasticity and better able to cope with large uncertainties. More specifically, supervised and imitation learning have achieved outstanding results in this regard, with the shortcoming of requiring some form of supervision which is not always obtainable for all settings. On the other hand, reinforcement learning (RL) requires much milder form of supervision but still remains impracticable due to its inefficiency. In this paper, we propose and theoretically motivate a novel Unsupervised Reward Shaping algorithm from expert's observations which relaxes the level of supervision required by the agent and works on improving RL performance in our task.
翻译:我们聚焦于物流行业中典型的卸货问题,该问题被建模为顺序拾取-放置任务。在此类任务中,现代机器学习技术相较于传统系统表现更优,因其对随机性具有更强适应性,且能更好应对高度不确定性。具体而言,监督学习与模仿学习在此方面取得了显著成果,但其缺陷在于需要某种形式的监督信号,而这类信号并非在所有场景中均可获取。另一方面,强化学习虽然对监督形式的要求弱得多,但由于效率低下仍难以实际应用。本文提出了一种基于专家观测数据的非监督奖励塑形算法,并从理论角度进行论证。该算法降低了智能体所需的监督水平,并在我们的任务中有效提升了强化学习性能。