Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset

Document-based Question-Answering (QA) tasks are crucial for precise information retrieval. While some existing work focus on evaluating large language model's performance on retrieving and answering questions from documents, assessing the LLMs' performance on QA types that require exact answer selection from predefined options and numerical extraction is yet to be fully assessed. In this paper, we specifically focus on this underexplored context and conduct empirical analysis of LLMs (GPT-4 and GPT 3.5) on question types, including single-choice, yes-no, multiple-choice, and number extraction questions from documents. We use the Cogtale dataset for evaluation, which provide human expert-tagged responses, offering a robust benchmark for precision and factual grounding. We found that LLMs, particularly GPT-4, can precisely answer many single-choice and yes-no questions given relevant context, demonstrating their efficacy in information retrieval tasks. However, their performance diminishes when confronted with multiple-choice and number extraction formats, lowering the overall performance of the model on this task, indicating that these models may not be reliable for the task. This limits the applications of LLMs on applications demanding precise information extraction from documents, such as meta-analysis tasks. However, these findings hinge on the assumption that the retrievers furnish pertinent context necessary for accurate responses, emphasizing the need for further research on the efficacy of retriever mechanisms in enhancing question-answering performance. Our work offers a framework for ongoing dataset evaluation, ensuring that LLM applications for information retrieval and document analysis continue to meet evolving standards.

翻译：文档问答任务对精确信息检索至关重要。现有研究虽已关注大语言模型在文档检索与问答中的表现，但针对需要从预设选项中精确选择答案及数值提取类问答的评估仍不充分。本文聚焦这一未充分探索的领域，对GPT-4与GPT-3.5在文档问答中的表现进行实证分析，涵盖单选题、是非题、多选题及数值提取四类问题。我们采用包含人类专家标注答案的Cogtale数据集进行评估，该数据集为精确性与事实依据提供了稳健基准。研究发现：在给定相关上下文时，LLM（尤其是GPT-4）能精准回答多数单选题与是非题，展现了其在信息检索任务中的有效性；但在处理多选题与数值提取时性能显著下降，导致模型在此类任务中的整体表现降低，表明其在此类场景中可能不可靠。这限制了LLM在需要精确文档信息提取的应用（如元分析任务）中的使用。然而，该结论基于检索器能提供准确回答所需相关上下文的假设，凸显了检索机制对提升问答性能的进一步研究必要性。本研究为持续数据集评估提供了框架，确保LLM在信息检索与文档分析领域的应用持续满足演进标准。