Multi-objective reinforcement learning for humanoid robots must coordinate locomotion and manipulation within a single policy. A natural design choice is whether to use a single (unified) critic that estimates the combined value of all objectives, or separate (dual) critics with disjoint reward signals. We present a controlled comparison on the Unitree G1 humanoid (23 active DoF) in NVIDIA Isaac Lab, training loco-manipulation policies through a sequential curriculum spanning 13 levels from stationary reaching to walking with variable-orientation targets. In standardized evaluation, dual-critic policies reach targets 3.5$\times$ faster (6.5 vs. 22.6 simulation steps), achieve 2$\times$ higher throughput (14.3 vs. 7.0 validated reaches per 1,000 steps), and attain higher validated reach rates (65.2% vs. 53.8%) compared to the unified-critic policy. Notably, additional anti-gaming reward mechanisms provide no further improvement beyond the architectural change alone (60.9% vs. 65.2%). These results have direct implications for the emerging paradigm of RL fine-tuning of imitation-learned policies: when refining a pre-trained manipulation policy with RL, a unified critic risks suppressing the learned behavior through competing locomotion gradients. These findings demonstrate that critic architecture is a primary - and often overlooked - design choice in multi-objective humanoid RL, with greater impact than reward engineering on reaching efficiency.
翻译:多目标强化学习在控制人形机器人时需通过单一策略协调行走与操作任务。一个核心设计抉择是采用能评估所有目标综合价值的统一评论家架构,还是使用对应不同奖励信号的独立双重评论家。我们在NVIDIA Isaac Lab中基于Unitree G1人形机器人(23个主动自由度)开展控制性对比实验,通过包含13个难度层级(从静态抓取到变向目标行走)的序列化课程训练操作-行走联合策略。标准化评测表明:相较于统一评论家策略,双重评论家策略使目标到达速度提升3.5倍(6.5步vs 22.6步),单位步数有效到达率提高2倍(每千步14.3次vs 7.0次),且验证到达率更高(65.2% vs 53.8%)。值得注意的是,额外添加的反博弈奖励机制未能提供超越架构改进本身的性能提升(60.9% vs 65.2%)。这些发现对当前新兴的「强化学习微调模仿学习策略」范式具有直接启示:当使用强化学习精调预训练操作策略时,统一评论家可能因竞争性行走梯度而压制已习得行为。研究证实,评论家架构是多目标人形机器人强化学习中被长期忽视的关键设计要素,其对到达效率的影响远超奖励工程优化。