Climate decision making is constrained by the complexity and inaccessibility of key information within lengthy, technical, and multi-lingual documents. Generative AI technologies offer a promising route for improving the accessibility of information contained within these documents, but suffer from limitations. These include (1) a tendency to hallucinate or mis-represent information, (2) difficulty in steering or guaranteeing properties of generated output, and (3) reduced performance in specific technical domains. To address these challenges, we introduce a novel evaluation framework with domain-specific dimensions tailored for climate-related documents. We then apply this framework to evaluate Retrieval-Augmented Generation (RAG) approaches and assess retrieval- and generation-quality within a prototype tool that answers questions about individual climate law and policy documents. In addition, we publish a human-annotated dataset and scalable automated evaluation tools, with the aim of facilitating broader adoption and robust assessment of these systems in the climate domain. Our findings highlight the key components of responsible deployment of RAG to enhance decision-making, while also providing insights into user experience (UX) considerations for safely deploying such systems to build trust with users in high-risk domains.
翻译:气候决策受到关键信息获取困难的制约,这些信息通常包含在冗长、技术性强且多语言的文档中。生成式人工智能技术为改善这些文档中信息的可及性提供了有前景的途径,但仍存在局限性。这些局限包括:(1)倾向于产生幻觉或错误表述信息,(2)难以引导或保证生成输出的特性,以及(3)在特定技术领域性能下降。为应对这些挑战,我们引入了一个针对气候相关文档量身定制的新型评估框架,包含领域特定的评估维度。随后,我们应用该框架评估检索增强生成(RAG)方法,并通过一个原型工具——该工具可回答关于具体气候法律与政策文档的问题——评估其检索与生成质量。此外,我们发布了一个人工标注的数据集及可扩展的自动化评估工具,旨在促进这些系统在气候领域的更广泛采用和稳健评估。我们的研究结果揭示了负责任部署RAG以增强决策的关键要素,同时为高风险领域中安全部署此类系统以建立用户信任的用户体验(UX)考量提供了见解。