Task oriented dialog systems often rely on static exploration strategies that do not adapt to dynamic dialog contexts, leading to inefficient exploration and suboptimal performance. We propose DyBBT, a novel dialog policy learning framework that formalizes the exploration challenge through a structured cognitive state space capturing dialog progression, user uncertainty, and slot dependency. DyBBT proposes a bandit inspired meta-controller that dynamically switches between a fast intuitive inference (System 1) and a slow deliberative reasoner (System 2) based on real-time cognitive states and visitation counts. Extensive experiments on single- and multi-domain benchmarks show that DyBBT achieves state-of-the-art performance in success rate, efficiency, and generalization, with human evaluations confirming its decisions are well aligned with expert judgment. Code is available at https://github.com/carsonz/DyBBT.
翻译:面向任务的对话系统通常依赖于静态探索策略,这些策略无法适应动态对话上下文,导致探索效率低下和性能欠佳。我们提出DyBBT,一种新颖的对话策略学习框架,通过构建一个结构化认知状态空间(涵盖对话进展、用户不确定性和槽位依赖性)来形式化探索挑战。DyBBT设计了一个受多臂赌博机启发的元控制器,能够基于实时认知状态和访问计数,在快速直觉推理(系统1)与慢速审慎推理(系统2)之间动态切换。在单领域和多领域基准测试上的大量实验表明,DyBBT在成功率、效率和泛化能力方面均达到最先进水平,人工评估也证实其决策与专家判断高度一致。代码发布于https://github.com/carsonz/DyBBT。