Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.
翻译:大语言模型(LLM)正在改变搜索与摘要生成等学术任务,但其可靠性仍存疑。当前测试LLM可靠性的评估指标主要采用自动化方法,虽注重效率与可扩展性,却缺乏语境细微差异,未能反映科学领域专家在实践中如何评估LLM输出。我们开发并验证了一个用于评估学术问答系统中LLM错误的框架,该框架体现了实践科学家的评估策略。通过与领域专家合作,我们通过对68组问答对的专题分析,识别出涵盖七个类别的20种错误模式。我们通过10位科学家的情境访谈验证了该框架,结果显示该框架不仅能呈现专家自然识别的错误类型,还能揭示结构化评估框架如何帮助专家发现先前被忽视的问题。领域专家采用系统性评估策略,包括技术精确性测试、基于价值的评估以及对其自身实践的无评估。我们讨论了支持专家评估LLM输出的相关启示,包括开发个性化、框架驱动的工具以适应个体评估模式与专业水平的可能性。