Large language models (LLMs) are now used for academic expert recommendation. Existing audits typically evaluate such recommendations in isolation, ignoring end-user inference-time interventions. Thus, it remains unclear whether failures (e.g., refusals, hallucinations, uneven coverage) stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that each intervention entails distinct tradeoffs. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing uniform gains. LLMScholarBench makes all these dynamics auditable across models and interventions in LLM-based scholar recommendations.
翻译:大语言模型现已用于学术专家推荐领域。现有审计通常单独评估此类推荐行为,忽略了终端用户推理阶段的干预措施。因此,尚不明确推荐失败(如拒绝作答、生成幻觉、覆盖不均)源于模型选择还是部署决策。我们提出LLMScholarBench——一个用于审计基于LLM的学者推荐的基准测试体系,该体系能联合评估多任务场景下的模型基础设施与终端用户干预措施。该基准通过九项指标衡量技术质量与社会代表性。我们在物理专家推荐场景中实例化该基准,在温度参数变化、表征约束提示、基于网络搜索的检索增强生成(RAG)等干预条件下,对22个LLM进行审计。结果表明:每种干预措施均存在独特权衡。更高的温度参数会降低有效性、一致性与事实准确性;表征约束提示在提升多样性的同时损害事实准确性;而RAG在提升技术质量的同时降低多样性与公平性。总体而言,终端用户干预措施改变的是权衡关系而非带来统一增益。LLMScholarBench使基于LLM的学者推荐系统中跨模型与干预措施的动态特性均可被审计。