Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM-based Recommendation

Large Language Models have revolutionized recommender systems (LLM4Rec) by leveraging their generative capabilities to model complex user preferences. However, existing LLM4Rec methods primarily rely on token-level objectives, making it difficult to optimize list-level and non-differentiable metrics (e.g., NDCG, fairness) that define actual recommendation quality. While Best-of-N (BoN) directly optimizes these metrics during inference, its high computational cost hinders real-world deployment. To address this, BoN Alignment aims to distill the search capability into the model itself, yet current approaches suffer from two critical limitations: (1) Indiscriminate Supervision, where the static reference fails to distinguish the relative quality of candidates exceeding its empirical range, leading to a loss of ranking guidance; and (2) Gradient Decay, where the effective supervision signal rapidly diminishes as the evolving policy improves, resulting in inefficient optimization. To overcome these challenges, we propose BLADE (Bayesian List-wise Alignment via Dynamic Estimation). Unlike static approaches, BLADE introduces a Bayesian framework that continuously updates the target distribution by fusing historical priors with dynamic evidence from the model's current rollouts. This mechanism constructs a self-evolving target that adapts to the model's growing capabilities, ensuring the training signal remains informative throughout the learning process. Extensive experiments on three real-world datasets demonstrate that BLADE significantly outperforms state-of-the-art baselines. Crucially, it breaks the static performance upper bound, achieving sustained gains in both ranking accuracy (Recall, NDCG) and complex list-wise metrics (Fairness, Diversity). The code is available via https://github.com/RegionCh/BLADE.

翻译：大语言模型通过利用其生成能力建模复杂用户偏好，彻底革新了推荐系统（LLM4Rec）。然而，现有LLM4Rec方法主要依赖token级目标，难以优化定义实际推荐质量的列表级和非可微分指标（如NDCG、公平性）。虽然最佳N（BoN）在推理过程中直接优化这些指标，但其高计算成本阻碍了实际部署。为解决此问题，BoN对齐旨在将搜索能力蒸馏到模型自身，但现有方法存在两个关键局限：（1）无差别监督——静态参考无法区分超出其经验范围的候选相对质量，导致排序指导失败；（2）梯度衰减——随着演化策略改进，有效监督信号迅速减弱，导致优化效率低下。为克服这些挑战，我们提出BLADE（基于动态估计的贝叶斯列表对齐）。与静态方法不同，BLADE引入贝叶斯框架，通过融合历史先验与模型当前推演的动态证据持续更新目标分布。该机制构建了能适应模型能力增长的自演化目标，确保训练信号在整个学习过程中保持信息量。在三个真实数据集上的大量实验表明，BLADE显著优于最先进的基线方法。关键在于，它打破了静态性能上限，在排序准确性（召回率、NDCG）和复杂列表级指标（公平性、多样性）上均实现了持续增益。代码已开源：https://github.com/RegionCh/BLADE。