Current Large Language Models (LLMs) are gradually exploited in practically valuable agentic workflows such as Deep Research, E-commerce recommendation, and job recruitment. In these applications, LLMs need to select some optimal solutions from massive candidates, which we term as \textit{LLM-as-a-Recommender} paradigm. However, the reliability of using LLM agents for recommendations is underexplored. In this work, we introduce a \textbf{Bias} \textbf{Rec}ommendation \textbf{Bench}mark (\textbf{BiasRecBench}) to highlight the critical vulnerability of such agents to biases in high-value real-world tasks. The benchmark includes three practical domains: paper review, e-commerce, and job recruitment. We construct a \textsc{Bias Synthesis Pipeline with Calibrated Quality Margins} that 1) synthesizes evaluation data by controlling the quality gap between optimal and sub-optimal options to provide a calibrated testbed to elicit the vulnerability to biases; 2) injects contextual biases that are logical and suitable for option contexts. Extensive experiments on both SOTA (Gemini-{2.5,3}-pro, GPT-4o, DeepSeek-R1) and small-scale LLMs reveal that agents frequently succumb to injected biases despite having sufficient reasoning capabilities to identify the ground truth. These findings expose a significant reliability bottleneck in current agentic workflows, calling for specialized alignment strategies for LLM-as-a-Recommender. The complete code and evaluation datasets will be made publicly available shortly.
翻译:当前大型语言模型在大规模候选集中挑选最优解决方案的范式(即\textit{LLM-as-a-Recommender})正逐步应用于深度研究、电商推荐、人才招聘等实际代理工作流。然而,LLM代理用于推荐的可信性问题尚缺乏系统研究。本文提出\textbf{偏见推荐基准测试}(\textbf{BiasRecBench}),旨在揭示此类代理在高价值现实任务中存在的关键脆弱性。该基准包含三个实际应用领域:论文评审、电子商务与人才招聘。我们构建了带校准质量边界的\textsc{偏见合成管道},通过以下方式实现:1)控制最优与次优选项的质量差距以合成评估数据,形成能诱发偏见脆弱性的校准测试环境;2)注入符合逻辑且适配选项上下文的情境偏见。对SOTA模型(Gemini-{2.5,3}-pro、GPT-4o、DeepSeek-R1)及小型LLM的广泛实验表明,尽管代理具备识别真实答案的充分推理能力,但仍频繁屈服于注入的偏见。这些发现揭示了当前代理工作流中的重大可靠性瓶颈,亟需针对LLM推荐范式开发专门的强化对齐策略。完整代码与评估数据集将于近期开源。