Recently, significant efforts have been devoted to enhancing the long-context capabilities of Large Language Models (LLMs), particularly in long-context reasoning. To facilitate this research, we propose \textbf{DetectiveQA}, a dataset specifically designed for narrative reasoning within long contexts. We leverage detective novels, averaging over 100k tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English, each paired with corresponding reference reasoning steps. Furthermore, we introduce a step-wise reasoning metric, which enhances the evaluation of LLMs' reasoning processes. We validate our approach and evaluate the mainstream LLMs, including GPT-4, Claude, and LLaMA, revealing persistent long-context reasoning challenges and demonstrating their evidence-retrieval challenges. Our findings offer valuable insights into the study of long-context reasoning and lay the base for more rigorous evaluations.
翻译:近年来,学术界投入大量精力提升大语言模型的长文本处理能力,尤其是在长文本推理方面。为推进相关研究,本文提出\textbf{侦探问答}数据集,专门针对长文本叙事推理任务而设计。我们利用平均长度超过10万词元的侦探小说,构建了包含1200个人工标注问题的中英双语数据集,每个问题均配有相应的参考推理步骤。此外,我们提出分步推理评估指标,以增强对大语言模型推理过程的评估效果。通过验证实验,我们对包括GPT-4、Claude和LLaMA在内的主流大语言模型进行评估,揭示了其在长文本推理中持续存在的挑战,并展示了模型在证据检索方面的不足。本研究为长文本推理研究提供了重要见解,并为更严谨的评估体系奠定了基础。