This paper introduces the Life Scapes Reasoning Benchmark (LSR-Benchmark), a novel dataset targeting real-life scenario reasoning, aiming to close the gap in artificial neural networks' ability to reason in everyday contexts. In contrast to domain knowledge reasoning datasets, LSR-Benchmark comprises free-text formatted questions with rich information on real-life scenarios, human behaviors, and character roles. The dataset consists of 2,162 questions collected from open-source online sources and is manually annotated to improve its quality. Experiments are conducted using state-of-the-art language models, such as gpt3.5-turbo and instruction fine-tuned llama models, to test the performance in LSR-Benchmark. The results reveal that humans outperform these models significantly, indicating a persisting challenge for machine learning models in comprehending daily human life.
翻译:本文提出生活场景推理基准(LSR-Benchmark),这是一个面向现实场景推理的新型数据集,旨在缩小人工神经网络在日常情境推理能力上的差距。与领域知识推理数据集不同,LSR-Benchmark包含自由文本格式的问题,涵盖丰富的现实场景信息、人类行为及角色特征。该数据集包含从开源在线资源收集的2162个问题,并通过人工标注提升质量。我们采用gpt3.5-turbo和指令微调llama模型等最先进语言模型进行实验,以测试其在LSR-Benchmark上的性能。结果表明,人类表现显著优于这些模型,揭示了机器学习模型在理解日常人类生活方面仍面临持续挑战。