Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.
翻译:大型语言模型(LLM)正日益应用于高风险、领域特定的场景中,如自然灾害响应和基础设施规划,以支持问答与决策制定。在这些场景中,有效的回答必须传达细粒度、对决策至关重要的细节。然而,现有的检索增强生成(RAG)和开放式问答评估框架主要依赖于表层相似性、事实一致性或语义相关性,往往无法评估回答是否提供了领域敏感决策所需的具体信息。为弥补这一不足,我们提出一个多维度的无参考评估框架,从四个互补维度评估LLM输出:特异性、对释义和语义扰动的鲁棒性、答案相关性以及上下文利用。我们引入了一个精心构建的数据集,包含1,412个领域特定的问答对,涵盖40种专业角色和七种自然灾害类型,以支持系统化评估。我们进一步进行了人工评估,以分析标注者间一致性以及模型输出与人工判断之间的对齐程度,这凸显了开放式领域特定评估固有的主观性。我们的结果表明,单一指标无法充分独立地捕捉答案质量,并证明了在高风险应用中部署LLM时,需要结构化的多指标评估框架。