Large language models (LLMs) are increasingly used for academic expert recommendation. Existing audits typically evaluate model outputs in isolation, largely ignoring end-user inference-time interventions. As a result, it remains unclear whether failures such as refusals, hallucinations, and uneven coverage stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures both technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that end-user interventions do not yield uniform improvements but instead redistribute error across dimensions. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing a general fix. We release code and data that can be adapted to other disciplines by replacing domain-specific ground truth and metrics.
翻译:大型语言模型(LLM)正日益被用于学术专家推荐。现有的审计通常孤立地评估模型输出,很大程度上忽略了终端用户在推理阶段的干预。因此,诸如拒绝回答、幻觉和不均衡覆盖等失败究竟源于模型选择还是部署决策,目前尚不清楚。我们提出了LLMScholarBench,这是一个用于审计基于LLM的学者推荐的基准测试,它联合评估了跨多个任务的模型基础设施和终端用户干预。LLMScholarBench使用九个指标来衡量技术质量和社会代表性。我们在物理学专家推荐任务中实例化了该基准,并通过调节温度参数、使用代表性约束提示以及通过网页搜索进行检索增强生成(RAG),对22个LLM进行了审计。我们的结果表明,终端用户干预并不会带来统一的改进,而是将错误在不同维度上重新分配。较高的温度会降低有效性、一致性和事实性。代表性约束提示以牺牲事实性为代价提高了多样性,而RAG主要提高了技术质量,同时降低了多样性和公平性。总体而言,终端用户干预重塑了权衡取舍,而非提供通用的解决方案。我们发布了代码和数据,通过替换特定领域的真实数据和指标,可将其适配至其他学科。