Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale datase

Document-based Question-Answering (QA) tasks are crucial for precise information retrieval. While some existing work focus on evaluating large language model's performance on retrieving and answering questions from documents, assessing the LLMs' performance on QA types that require exact answer selection from predefined options and numerical extraction is yet to be fully assessed. In this paper, we specifically focus on this underexplored context and conduct empirical analysis of LLMs (GPT-4 and GPT 3.5) on question types, including single-choice, yes-no, multiple-choice, and number extraction questions from documents. We use the Cogtale dataset for evaluation, which provide human expert-tagged responses, offering a robust benchmark for precision and factual grounding. We found that LLMs, particularly GPT-4, can precisely answer many single-choice and yes-no questions given relevant context, demonstrating their efficacy in information retrieval tasks. However, their performance diminishes when confronted with multiple-choice and number extraction formats, lowering the overall performance of the model on this task, indicating that these models may not be reliable for the task. This limits the applications of LLMs on applications demanding precise information extraction from documents, such as meta-analysis tasks. However, these findings hinge on the assumption that the retrievers furnish pertinent context necessary for accurate responses, emphasizing the need for further research on the efficacy of retriever mechanisms in enhancing question-answering performance. Our work offers a framework for ongoing dataset evaluation, ensuring that LLM applications for information retrieval and document analysis continue to meet evolving standards.

翻译：文档问答任务对于精确信息检索至关重要。尽管现有部分研究评估了大语言模型在文档检索与问答方面的性能，但针对需要从预定义选项中精确选择答案以及进行数值提取的问答类型，模型表现仍有待全面评估。本文专门聚焦这一未被充分探索的领域，对GPT-4和GPT-3.5等大语言模型在文档中的单选题、是非题、多选题及数字提取题等题型进行了实证分析。我们采用包含人类专家标注答案的Cogtale数据集进行评估，该数据集为精确性和事实基础提供了稳健基准。研究发现，大语言模型（尤其是GPT-4）在获取相关上下文时能精确回答许多单选题和是非题，展现了其在信息检索任务中的有效性。然而，面对多选题和数字提取格式时模型性能显著下降，导致整体任务表现降低，表明这些模型可能无法可靠胜任此类任务。这限制了LLM在需要从文档中精确提取信息的应用场景（如元分析任务）中的使用。但上述发现基于"检索器能提供准确作答所需的相关语境"这一假设，凸显了进一步研究检索机制对增强问答性能作用的重要性。本研究为持续的数据集评估提供了框架，确保用于信息检索和文档分析的大语言模型应用能持续满足不断演进的标准。