RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension

Hung Phan,Anurag Acharya,Sarthak Chaturvedi,Shivam Sharma,Mike Parker,Dan Nally,Ali Jannesari,Karl Pazdernik,Mahantesh Halappanavar,Sai Munikoti,Sameera Horawalavithana

from arxiv, 14 pages

Large Language Models (LLMs) have been applied to many research problems across various domains. One of the applications of LLMs is providing question-answering systems that cater to users from different fields. The effectiveness of LLM-based question-answering systems has already been established at an acceptable level for users posing questions in popular and public domains such as trivia and literature. However, it has not often been established in niche domains that traditionally require specialized expertise. To this end, we construct the NEPAQuAD1.0 benchmark to evaluate the performance of three frontier LLMs -- Claude Sonnet, Gemini, and GPT-4 -- when answering questions originating from Environmental Impact Statements prepared by U.S. federal government agencies in accordance with the National Environmental Environmental Act (NEPA). We specifically measure the ability of LLMs to understand the nuances of legal, technical, and compliance-related information present in NEPA documents in different contextual scenarios. For example, we test the LLMs' internal prior NEPA knowledge by providing questions without any context, as well as assess how LLMs synthesize the contextual information present in long NEPA documents to facilitate the question/answering task. We compare the performance of the long context LLMs and RAG powered models in handling different types of questions (e.g., problem-solving, divergent). Our results suggest that RAG powered models significantly outperform the long context models in the answer accuracy regardless of the choice of the frontier LLM. Our further analysis reveals that many models perform better answering closed questions than divergent and problem-solving questions.

翻译：大语言模型（LLMs）已被广泛应用于多个领域的研究问题。LLMs的应用之一是为不同领域的用户提供问答系统。基于LLM的问答系统在常识性及公开领域（如琐事问答与文学）的用户提问中已达到可接受的有效性水平，但在传统上需要专业知识的细分领域，其有效性尚未得到充分验证。为此，我们构建了NEPAQuAD1.0基准测试，用于评估三种前沿LLM——Claude Sonnet、Gemini与GPT-4——在回答源自美国联邦政府机构根据《国家环境政策法案》（NEPA）编制的《环境影响报告书》所提问题时的表现。我们重点测量了LLMs在不同情境下理解NEPA文档中法律、技术与合规相关信息的细微差别的能力。例如，我们通过提供无上下文的问题测试LLMs对NEPA的先验知识，并评估LLMs如何综合长篇幅NEPA文档中的上下文信息以辅助问答任务。我们比较了长上下文LLM与基于RAG的模型在处理不同类型问题（如问题解决型、发散型）时的性能。结果表明，无论选择何种前沿LLM，基于RAG的模型在答案准确性上均显著优于长上下文模型。进一步分析显示，多数模型在回答封闭式问题时的表现优于发散型与问题解决型问题。