Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback-Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty-aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient architecture: the ensemble shares a pre-trained backbone while maintaining diversity through independent Low-Rank Adaptation (LoRA) branches. Theoretically, we prove that POETS implicitly conducts KL-regularized Thompson sampling and thus inherits strong cumulative regret bounds of ${\mathcal O}(\sqrt{T γ_T})$. Empirically, we demonstrate that POETS achieves state-of-the-art sample efficiency across diverse scientific discovery domains, including protein search and quantum circuit design. Furthermore, it improves the optimization trajectories of reinforcement learning, proving particularly robust in off-policy settings with experience replay or in small dataset regimes.
翻译:摘要:在序贯决策与黑箱优化中,平衡探索与利用是一项核心挑战。我们提出POETS($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling)这一新型框架,将不确定性量化与策略优化相衔接。本方法基于以下洞见:采用KL散度正则化训练的策略隐式编码了底层奖励函数。基于此,POETS绕过了训练感知不确定性奖励模型并单独拟合策略的复杂嵌套过程,转而直接训练策略集成——通过将隐式编码的奖励函数与在线自助采样数据匹配来捕捉认知不确定性。为克服大语言模型(LLM)集成带来的巨大计算与内存开销,POETS采用高效架构:集成模型共享预训练主干网络,同时通过独立的低秩自适应(LoRA)分支维持多样性。理论上,我们证明POETS隐式执行KL正则化的汤普森采样,从而继承${\mathcal O}(\sqrt{T γ_T})$的强累积遗憾界。实验表明,POETS在蛋白质搜索与量子电路设计等多元科学发现领域实现了最先进的样本效率。此外,该方法改善了强化学习的优化轨迹,在包含经验回放的离策略场景或小数据集场景下表现出尤为突出的鲁棒性。