Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset

Document-based Question-Answering (QA) tasks are crucial for precise information retrieval. While some existing work focus on evaluating large language model's performance on retrieving and answering questions from documents, assessing the LLMs' performance on QA types that require exact answer selection from predefined options and numerical extraction is yet to be fully assessed. In this paper, we specifically focus on this underexplored context and conduct empirical analysis of LLMs (GPT-4 and GPT 3.5) on question types, including single-choice, yes-no, multiple-choice, and number extraction questions from documents. We use the Cogtale dataset for evaluation, which provide human expert-tagged responses, offering a robust benchmark for precision and factual grounding. We found that LLMs, particularly GPT-4, can precisely answer many single-choice and yes-no questions given relevant context, demonstrating their efficacy in information retrieval tasks. However, their performance diminishes when confronted with multiple-choice and number extraction formats, lowering the overall performance of the model on this task, indicating that these models may not be reliable for the task. This limits the applications of LLMs on applications demanding precise information extraction from documents, such as meta-analysis tasks. However, these findings hinge on the assumption that the retrievers furnish pertinent context necessary for accurate responses, emphasizing the need for further research on the efficacy of retriever mechanisms in enhancing question-answering performance. Our work offers a framework for ongoing dataset evaluation, ensuring that LLM applications for information retrieval and document analysis continue to meet evolving standards.

翻译：基于文档的问答任务对于精确信息检索至关重要。尽管现有研究关注评估大语言模型从文档中检索并回答问题的能力，但针对需要从预定义选项中选择精确答案及进行数值提取的问答类型，对LLM性能的评估尚不充分。本文聚焦这一未充分探索的领域，对LLM（GPT-4和GPT 3.5）在包括单选题、是非题、多选题及文档数值提取题在内的问答类型上进行实证分析。我们采用提供专家标注答案的Cogtale数据集作为评估基准，其兼具精确性与事实依据。研究发现，LLM（特别是GPT-4）在给定相关上下文时能精准回答多数单选题和是非题，展现了其在信息检索任务中的有效性。然而，面对多选题和数值提取题时，其性能显著下降，导致模型在该任务上的整体表现降低，表明这些模型在此类任务中可能不可靠。这限制了LLM在需要从文档中精确提取信息的应用场景（如元分析任务）中的实用性。需强调的是，上述结论基于检索器提供准确响应所需相关上下文的假设，凸显了进一步研究检索机制对提升问答性能效用的必要性。本工作为持续的数据集评估提供了框架，以确保LLM在信息检索与文档分析领域的应用能持续满足不断发展的标准。