Softmax feedback systems are a common mathematical core of entropy-regularized reinforcement learning, logit game dynamics, population choice, and mean-field variational updates. Their central stability question is simple: when does a self-reinforcing softmax system produce a unique and globally predictable outcome? Classical theory gives a conservative answer. By treating softmax as a unit-scale response, it certifies stability only in a strongly randomized regime. We prove that the classical approach misses an entire stable regime and does not identify the point at which the qualitative change truly occurs. For finite-dimensional affine logit systems, the sharp dimension-free Euclidean threshold is $$β\|ΠWΠ\|_{\mathcal T\to\mathcal T}<2,$$ rather than the previously used condition, which certifies stability only while the softmax system remains safely over-regularized. Our theorem fills the previously missing pre-bifurcation regime, extending stability guarantees for affine softmax feedback systems to reward-responsive yet globally predictable systems. It enlarges the certified stability boundary for these systems and identifies where the model genuinely undergoes a phase transition.
翻译:Softmax反馈系统是熵正则化强化学习、Logit博弈动力学、群体选择及平均场变分更新的共同数学核心。其核心稳定性问题十分简明:自增强Softmax系统何时产生唯一且全局可预测的结果?经典理论给出了保守答案——通过将Softmax视为单位尺度的响应,该理论仅在强随机化条件下确保稳定性。我们证明,经典方法遗漏了整个稳定区域,且未能识别质变真正发生的临界点。对于有限维仿射Logit系统,尖锐的无维度欧几里得阈值为$$β\|ΠWΠ\|_{\mathcal T\to\mathcal T}<2$$,而非先前使用的条件(该条件仅当Softmax系统保持安全过正则化时才能保证稳定性)。我们的定理填补了此前缺失的分岔前区域,将仿射Softmax反馈系统的稳定性保障扩展至具有奖励响应性且全局可预测的系统。该定理扩大了此类系统的认证稳定边界,并确定了模型真正经历相变的临界点。