Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective -- drag minimization or maximization -- is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.


翻译:许多高性能的人类活动在极少或没有外部反馈的情况下执行:例如花样滑冰运动员完成三周跳、棒球投手投出精准的曲线球、咖啡师制作拉花图案。为在完全受控条件下研究技能习得过程,我们绕过人类受试者,直接让通用强化学习智能体与桌面循环水槽中的旋转圆柱体交互,以最大化或最小化流体阻力。该实验设置具有若干优势:首先,这是一个物理系统,具备真实世界特有的丰富相互作用与复杂动力学特性——流动呈现高度混沌状态,且难以(甚至无法)精确建模或模拟;其次,目标(阻力最小化或最大化)易于表述并可直接转化为奖励函数,但高效策略在事先并不明显;再者,数十年前的实验研究已为简单的高性能开环控制策略提供了方案;最后,该装置成本低廉,且比人类研究更易复现。实验发现:高维流动反馈使智能体仅通过数分钟真实世界交互即可发现高效的阻力控制策略。当后续无反馈条件下重放相同动作序列时,我们获得了几乎一致的性能表现,这表明执行习得策略时无需反馈(尤其是流动反馈)。令人惊讶的是,在训练阶段若无流动反馈,智能体在阻力最大化任务中无法发现任何有效策略,但在阻力最小化任务中仍能成功(尽管速度更慢且稳定性降低)。我们的研究表明:习得高性能技能可能需要比执行该技能更丰富的信息,且学习环境的友好程度可能仅取决于任务目标,而与系统动力学或策略复杂性无关。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员