Thompson sampling is a widely used strategy for contextual bandits: at each round, it samples a reward function from a Bayesian posterior and acts greedily under that sample. Prior-data fitted networks (PFNs), such as TabPFN v2+ and TabICL v2, are attractive candidates for this purpose because they approximate Bayesian posterior predictive distributions in a single forward pass. However, PFNs predict noisy future rewards, while Thompson sampling requires uncertainty over the latent mean reward function. We propose PFN-TS, a Thompson sampling algorithm that converts PFN posterior predictives into mean-reward samples using a subsampled predictive central limit theorem. The method estimates posterior variance from a geometric grid of $O(\log n)$ dataset prefixes rather than the full $O(n)$ predictive sequence used in previous predictive-sequence approaches, and reuses TabICL's cached representations across rounds. We prove consistency of the subsampled variance estimator and give a Bayesian regret bound that decomposes PFN-TS regret into exact posterior-sampling regret under the PFN prior plus approximation terms. Empirically, PFN-TS achieves the best average rank across nonlinear synthetic and OpenML classification-to-bandit benchmarks, remains competitive on linear and BART-generated rewards, and attains the highest estimated policy value in an offline mobile-health evaluation. Code is available at https://anonymous.4open.science/r/PFN_TS-36ED/.
翻译:汤普森采样是语境强盗问题中广泛使用的策略:每轮中,该策略从贝叶斯后验分布中采样奖励函数,并依据该采样结果贪心地选择行动。先验数据拟合网络(PFN),如TabPFN v2+和TabICL v2,因其能在单次前向传播中近似贝叶斯后验预测分布而成为实现该策略的理想候选。然而,PFN预测的是带噪声的未来奖励,而汤普森采样则需要关于潜在均值奖励函数的不确定性。我们提出PFN-TS,一种基于汤普森采样的算法,它利用子采样预测中心极限定理将PFN的后验预测转化为均值奖励样本。该方法从包含$O(\log n)$个数据集前缀的几何网格中估计后验方差,而非使用先前预测序列方法所需的完整$O(n)$个预测序列,并在各轮之间复用TabICL的缓存表示。我们证明了子采样方差估计量的一致性,并给出了一个贝叶斯遗憾界,该界将PFN-TS的遗憾分解为在PFN先验下的精确后验采样遗憾加上近似项。实验表明,在非线性合成基准和OpenML分类转语境强盗基准上,PFN-TS取得了最佳平均排名,在线性和BART生成的奖励任务上保持竞争力,并在离线移动健康评估中获得了最高的估计策略价值。代码见 https://anonymous.4open.science/r/PFN_TS-36ED/。