We evaluate whether factor-wise auxiliary dynamics supervision produces useful latent structure or improved robustness in simulated humanoid locomotion. DynaMITE -- a transformer encoder with a factored 24-d latent trained by per-factor auxiliary losses during proximal policy optimization (PPO) -- is compared against Long Short-Term Memory (LSTM), plain Transformer, and Multilayer Perceptron (MLP) baselines on a Unitree G1 humanoid across four Isaac Lab tasks. The supervised latent shows no evidence of decodable or functionally separable factor structure: probe R^2 ~ 0 for all five dynamics factors, clamping any subspace changes reward by < 0.05, and standard disentanglement metrics (MIG, DCI, SAP) are near zero. An unsupervised LSTM hidden state achieves higher probe R^2 (up to 0.10). A 2x2 factorial ablation (n = 10 seeds) isolates the contributions of the tanh bottleneck and auxiliary losses: the auxiliary losses show no measurable effect on either in-distribution (ID) reward (+0.03, p = 0.732) or severe out-of-distribution (OOD) reward (+0.03, p = 0.669), while the bottleneck shows a small, consistent advantage in both regimes (ID: +0.16, p = 0.207; OOD: +0.10, p = 0.208). The bottleneck advantage persists under severe combined perturbation but does not amplify, indicating a training-time representation benefit rather than a robustness mechanism. LSTM achieves the best nominal reward on all four tasks (p < 0.03); DynaMITE degrades less under combined-shift stress (2.3% vs. 16.7%), but this difference is attributable to the bottleneck compression, not the auxiliary supervision. For locomotion practitioners: auxiliary dynamics supervision does not produce an interpretable estimator and does not measurably improve reward or robustness beyond what the bottleneck alone provides; recurrent baselines remain the stronger choice for nominal performance.
翻译:我们评估了因子式辅助动态监督是否能在模拟人形机器人行走中产生有用的潜在结构或增强鲁棒性。DynaMITE——一种在近端策略优化(PPO)过程中通过各因子辅助损失训练的、具有24维因子化潜在表示的Transformer编码器——与长短期记忆网络(LSTM)、普通Transformer及多层感知机(MLP)基线模型在Unitree G1人形机器人上进行了四项Isaac Lab任务的比较。受监督的潜在表示未展现出可解码或功能可分离的因子结构:所有五个动力学因子的探针R²约为0,对任意子空间进行钳制导致奖励变化小于0.05,且标准解缠度量(MIG、DCI、SAP)均接近零。无监督的LSTM隐藏状态取得了更高的探针R²(最高达0.10)。一项2×2因子消融实验(n=10个种子)分离了tanh瓶颈与辅助损失的影响:辅助损失对分布内(ID)奖励(+0.03,p=0.732)或严重分布外(OOD)奖励(+0.03,p=0.669)均未产生可测量效应;而瓶颈在两种情境下均展现出微小但一致的增益(ID:+0.16,p=0.207;OOD:+0.10,p=0.208)。瓶颈的优势在严重联合扰动下依然存在但未增强,表明这属于训练阶段的表示优势,而非鲁棒性机制。LSTM在所有四项任务上取得了最佳名义奖励(p<0.03);DynaMITE在联合偏移压力下性能下降更少(2.3% vs. 16.7%),但此差异可归因于瓶颈压缩,而非辅助监督。对行走实践者而言:辅助动态监督并未产生可解释的估计器,也未在瓶颈单独作用的基础上可测量地提升奖励或鲁棒性;就名义性能而言,循环神经网络基线仍是更优选择。