We study sequential decision-making when the agent's internal model class is misspecified. Within the infinite-horizon Berk-Nash framework, stable behavior arises as a fixed point: the agent acts optimally relative to a subjective model, while that model is statistically consistent with the long-run data endogenously generated by the policy itself. We provide a rigorous characterization of this equilibrium via coupled linear programs and a bilevel optimization formulation. To address the intrinsic non-smoothness of standard best-response correspondences, we introduce entropy regularization, establishing the existence of a unique soft Bellman fixed point and a smooth objective. Exploiting this regularity, we develop an online learning scheme that casts model selection as an adversarial bandit problem using an EXP3-type update, augmented by a novel conjecture-set zooming mechanism that adaptively refines the parameter space. Numerical results demonstrate effective exploration-exploitation trade-offs, convergence to the KL-minimizing model, and sublinear regret.
翻译:本文研究智能体内部模型类别设定错误时的序贯决策问题。在无限期Berk-Nash框架下,稳定行为作为不动点出现:智能体相对于主观模型采取最优行动,而该模型在统计上与策略本身内生生成的长期数据保持一致。我们通过耦合线性规划与双层优化公式对此均衡进行了严格刻画。为解决标准最优反应对应关系固有的非光滑性问题,我们引入熵正则化方法,证明了唯一软贝尔曼不动点的存在性并获得光滑目标函数。利用此正则性,我们提出一种在线学习方案,将模型选择建模为使用EXP3型更新的对抗性赌博机问题,并辅以新颖的猜想集缩放机制来自适应细化参数空间。数值结果展示了有效的探索-利用权衡、向KL最小化模型的收敛性以及次线性遗憾界。