Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits remain English-centric, single discipline, and persona-agnostic, leaving the source of output variability poorly understood. To this end, we propose a benchmark that disentangles the effects of model choice and prompt design on recommendations. We audit 43 LLMs by varying persona prompts (language, location, role-and-task) and context (field, seniority, k). Recommended scholars are compared against Semantic Scholar over six scientific disciplines to measure technical quality (factuality, coverage) and social representativeness (diversity, parity). Basic technical quality is driven by model choice, factuality and parity by context, and diversity by location. South Africa prompts yield less factual lists, while Japan prompts yield highly factual but homogeneous lists skewed toward highly productive scholars. Prompt design is thus a non-trivial axis of LLM-based scholar discovery and should be systematically audited alongside model choice.
翻译:大型语言模型日益被用作学者推荐工具,影响着学术界中谁被视为专家。现有审计仍以英语为中心、局限于单一学科且忽视人格因素,导致对输出差异的来源认知不足。为此,我们提出了一项基准测试,用于分离模型选择和提示设计对推荐结果的影响。我们通过改变人格提示(语言、位置、角色与任务)和上下文(领域、资历、k值)对43个LLM进行审计。将推荐学者与六个科学学科领域的语义学者进行比较,以衡量技术质量(真实性、覆盖率)和社会代表性(多样性、公平性)。基本技术质量由模型选择驱动,真实性和公平性由上下文驱动,多样性由位置驱动。南非提示产生的列表真实性较低,而日本提示产生的列表高度真实但同质化,且偏向高产出学者。因此,提示设计是基于LLM的学者发现中一个不可忽视的维度,应与模型选择一同进行系统性审计。