Many ideas in modern control and reinforcement learning treat decision-making as inference: start from a baseline distribution and update it when a signal arrives. We ask when this can be made literal rather than metaphorical. We study the special case where a KL-regularized soft update is exactly a Bayesian posterior inside a single fixed probabilistic model, so the update variable is a genuine channel through which information is transmitted. In this regime, behavioral change is driven only by evidence carried by that channel: the update must be explainable as an evidence reweighing of the baseline. This yields a sharp identification result: posterior updates determine the relative, context-dependent incentive signal that shifts behavior, but they do not uniquely determine absolute rewards, which remain ambiguous up to context-specific baselines. Requiring one reusable continuation value across different update directions adds a further coherence constraint linking the reward descriptions associated with different conditioning orders.
翻译:现代控制与强化学习中的许多思想将决策视为推断过程:从基准分布出发,并在信号到达时对其进行更新。本文探讨了这种观点何时能够成为字面意义上的机制而非隐喻。我们研究了KL正则化软更新在特定固定概率模型内精确对应贝叶斯后验的特殊情形,此时更新变量成为信息传递的真实通道。在此机制下,行为变化仅由该通道承载的证据驱动:更新必须能够解释为对基准分布的证据重加权。这导出了一个精确的识别结果:后验更新决定了驱动行为变化的相对性、上下文依赖的激励信号,但无法唯一确定绝对奖励值——后者仍受上下文特定基准的模糊性影响。若要求在不同更新方向上保持可重复使用的延续价值,则会进一步产生连接不同条件顺序对应奖励描述的相干性约束。