As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users are often domain experts who expect not just answers, but explanations that mirror professional reasoning. However, most automatic evaluations of explanations prioritize plausibility or faithfulness, rather than testing whether an LLM thinks like an expert. Existing approaches to evaluating professional reasoning rely heavily on per-example expert annotation, making such evaluations costly and difficult to scale. To address this gap, we introduce the T-FIX benchmark, spanning seven scientific tasks across three domains, to operationalize expert alignment as a desired attribute of LLM-generation explanations. Our framework enables automatic evaluation of expert alignment, generalizing to unseen explanations and eliminating the need for ongoing expert involvement.
翻译:随着大型语言模型在知识密集型领域(如外科手术、天文学、心理治疗)的部署,用户通常是领域专家,他们不仅期望获得答案,更要求解释能体现专业推理逻辑。然而,当前大多数自动解释评估方法主要关注解释的合理性或忠实度,而非测试大型语言模型是否具备专家式思维。现有评估专业推理的方法严重依赖逐例专家标注,导致评估成本高昂且难以扩展。为弥补这一空白,我们提出了T-FIX基准测试,涵盖三个领域的七项科学任务,将专家对齐性操作化为大型语言模型生成解释的理想属性。该框架实现了专家对齐性的自动评估,能够泛化至未见过的解释,并消除了持续依赖专家参与的需求。