A reinforcement-learning agent maximises its reward, which can diverge from the outcome its designer intended. In physical control the reward rarely closes that gap, and drag reduction in wall turbulence makes it concrete. A mass-conservation projection couples agents' outputs and erases the per-agent credit the policy gradient needs; a memoryless policy cannot resolve the slow near-wall cycle it acts on; and a pressure-gradient reward pays for nominal drag reduction by pumping power through the wall. Two degenerate controllers achieve large drag reductions while total dissipation rises, so the reported figure can mask a more wasteful flow. We trace each fault to its cause and fix it: a differentiable projection that restores credit, a recurrent policy with a widened sensing stencil, and a reward scored on the true wall power. The corrected controller acts on the flow within a closed energy budget, earning a conservative $17\%$ under honest accounting.
翻译:强化学习智能体最大化其奖励,但奖励可能与设计者的预期目标产生偏离。在物理控制中,奖励机制往往无法弥合这一差距,而壁湍流减阻问题则具体呈现了这一矛盾。质量守恒投影耦合了智能体的输出,消除了策略梯度所需的每个智能体信用分配;无记忆策略无法解析其作用的慢速近壁循环;压力梯度奖励通过壁面泵浦功率换取了名义上的减阻效果。两种退化控制器在总耗散增加的情况下实现了大幅减阻,因此所报道的数值可能掩盖了更耗能的流场状态。我们追溯了各故障的成因并予以修正:采用可微投影恢复信用分配,使用具有扩宽感知模板的循环策略,以及基于真实壁面功率评分的奖励机制。修正后的控制器在封闭能量预算范围内对流场施加控制,在准确核算下实现了保守的$17\%$减阻。