Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset

Zafaryab Rasool,Stefanus Kurniawan,Sherwin Balugo,Scott Barnett,Rajesh Vasa,Courtney Chesser,Benjamin M. Hampstead,Sylvie Belleville,Kon Mouzakis,Alex Bahar-Fuchs

from arxiv, 10 pages, 1 figure

Document-based Question-Answering (QA) tasks are crucial for precise information retrieval. While some existing work focus on evaluating large language models performance on retrieving and answering questions from documents, assessing the LLMs performance on QA types that require exact answer selection from predefined options and numerical extraction is yet to be fully assessed. In this paper, we specifically focus on this underexplored context and conduct empirical analysis of LLMs (GPT-4 and GPT-3.5) on question types, including single-choice, yes-no, multiple-choice, and number extraction questions from documents in zero-shot setting. We use the CogTale dataset for evaluation, which provide human expert-tagged responses, offering a robust benchmark for precision and factual grounding. We found that LLMs, particularly GPT-4, can precisely answer many single-choice and yes-no questions given relevant context, demonstrating their efficacy in information retrieval tasks. However, their performance diminishes when confronted with multiple-choice and number extraction formats, lowering the overall performance of the model on this task, indicating that these models may not yet be sufficiently reliable for the task. This limits the applications of LLMs on applications demanding precise information extraction from documents, such as meta-analysis tasks. These findings hinge on the assumption that the retrievers furnish pertinent context necessary for accurate responses, emphasizing the need for further research. Our work offers a framework for ongoing dataset evaluation, ensuring that LLM applications for information retrieval and document analysis continue to meet evolving standards.

翻译：文档问答任务对于精确信息检索至关重要。尽管已有研究评估大型语言模型在从文档中检索并回答问题的能力，但针对需要从预定义选项中精确选择答案以及进行数值提取的问答类型，LLMs的表现尚未得到充分评估。本文聚焦这一未充分探索的领域，在零样本设置下对LLMs（GPT-4和GPT-3.5）进行了实证分析，考察其在处理文档中单选题、是非题、多选题和数字提取类问题时的表现。我们采用Cogtale数据集进行评估，该数据集提供了专家标注的参考答案，为精确性和事实依据提供了稳健基准。研究发现，LLMs（特别是GPT-4）在给定相关上下文的情况下，能够准确回答许多单选题和是非题，展示了其在信息检索任务中的有效性。然而，当面对多选题和数字提取格式时，模型性能显著下降，导致整体任务表现降低，表明这些模型在此类任务上可能尚不具备足够的可靠性。这限制了LLMs在需要从文档中精确提取信息的应用场景（如元分析任务）中的潜力。上述发现基于一个关键假设，即检索器能够为准确回答提供必要上下文，这凸显了进一步研究的必要性。本研究为持续的数据集评估提供了一个框架，确保用于信息检索和文档分析的LLM应用能够不断满足日益发展的标准。