Thompson Sampling is a principled method for balancing exploration and exploitation, but its real-world adoption faces computational challenges in large-scale or non-conjugate settings. While ensemble-based approaches offer partial remedies, they typically require prohibitively large ensemble sizes. We propose Ensemble++, a scalable exploration framework using a novel shared-factor ensemble architecture with random linear combinations. For linear bandits, we provide theoretical guarantees showing that Ensemble++ achieves regret comparable to exact Thompson Sampling with only $\Theta(d \log T)$ ensemble sizes--significantly outperforming prior methods. Crucially, this efficiency holds across both compact and finite action sets with either time-invariant or time-varying contexts without configuration changes. We extend this theoretical foundation to nonlinear rewards by replacing fixed features with learnable neural representations while preserving the same incremental update principle, effectively bridging theory and practice for real-world tasks. Comprehensive experiments across linear, quadratic, neural, and GPT-based contextual bandits validate our theoretical findings and demonstrate Ensemble++'s superior regret-computation tradeoff versus state-of-the-art methods.
翻译:汤普森采样是一种在探索与利用之间进行权衡的原则性方法,但其在实际应用中的大规模或非共轭场景下面临计算挑战。尽管基于集成的方法提供了部分解决方案,但它们通常需要过大的集成规模。我们提出Ensemble++,一种利用新颖的共享因子集成架构与随机线性组合的可扩展探索框架。针对线性赌博机问题,我们提供了理论保证,表明Ensemble++仅需$\Theta(d \log T)$的集成规模即可实现与精确汤普森采样相当的遗憾界——显著优于现有方法。关键在于,这种效率在紧致或有限动作集、以及时不变或时变上下文的情况下均成立,且无需调整配置。我们通过将固定特征替换为可学习的神经表示,同时保持相同的增量更新原则,将这一理论框架扩展至非线性奖励场景,从而有效弥合了理论与实际任务之间的鸿沟。在线性、二次、神经网络及基于GPT的上下文赌博机上的全面实验验证了我们的理论发现,并证明了Ensemble++在遗憾-计算权衡方面相较于前沿方法的优越性。