We carry out a series of experiments to test large language models' multi-hop reasoning ability from three aspects: selecting and combining external knowledge, dealing with non-sequential reasoning tasks and generalising to data samples with larger numbers of hops. We test the GPT-3.5 model on four reasoning benchmarks with Chain-of-Thought prompting (and its variations). Our results reveal that despite the amazing performance achieved by large language models on various reasoning tasks, models still suffer from severe drawbacks which shows a large gap with humans.
翻译:我们通过一系列实验从三个方面测试大语言模型的多跳推理能力:外部知识的选择与整合、非序列推理任务的处理能力,以及向更高跳数数据样本的泛化能力。我们在四个推理基准测试中使用思维链提示(及其变体)对GPT-3.5模型进行评估。结果表明,尽管大语言模型在各种推理任务上取得了令人瞩目的性能,模型仍存在显著缺陷,与人类推理能力存在较大差距。