Can Large Language Models do Analytical Reasoning?

This paper explores the cutting-edge Large Language Model with analytical reasoning on sports. Our analytical reasoning embodies the tasks of letting large language models count how many points each team scores in a quarter in the NBA and NFL games. Our major discoveries are in two folds. Firstly, we find among all the models we employed, GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind. Specifically, we compare three different prompting techniques and a divide-and-conquer approach, we find that the latter was the most effective. Our divide-and-conquer approach breaks down play-by-play data into smaller, more manageable segments, solves each piece individually, and then aggregates them together. Besides the divide-and-conquer approach, we also explore the Chain of Thought (CoT) strategy, which markedly improves outcomes for certain models, notably GPT-4 and Claude-2.1, with their accuracy rates increasing significantly. However, the CoT strategy has negligible or even detrimental effects on the performance of other models like GPT-3.5 and Gemini-Pro. Secondly, to our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores. This leads us to further investigate the factors that impact the complexity of analytical reasoning tasks with extensive experiments, through which we conclude that task complexity depends on the length of context, the information density, and the presence of related information. Our research provides valuable insights into the complexity of analytical reasoning tasks and potential directions for developing future large language models.

翻译：本文探讨了前沿大型语言模型在体育领域中的分析推理能力。我们的分析推理任务包括让大型语言模型统计NBA和NFL比赛中各队在每节比赛中的得分。主要发现体现在两个方面：首先，在所有使用的模型中，GPT-4表现最为有效，其次是Claude-2.1，而GPT-3.5、Gemini-Pro和Llama-2-70b则相对落后。具体而言，我们比较了三种不同的提示技术与分治方法，发现后者最为有效。我们的分治方法将逐回合数据分解为更小、更易管理的片段，分别求解每个片段，然后汇总结果。除分治方法外，我们还探索了思维链（CoT）策略，该策略显著提升了特定模型的性能，尤其是GPT-4和Claude-2.1，其准确率大幅提高。然而，CoT策略对其他模型（如GPT-3.5和Gemini-Pro）的效果微乎其微，甚至产生负面影响。其次，令我们惊讶的是，大多数模型（包括GPT-4）在准确统计NBA节总得分方面表现不佳，尽管在统计NFL节得分方面表现出色。这促使我们通过大量实验进一步探究影响分析推理任务复杂性的因素，最终得出结论：任务复杂性取决于上下文长度、信息密度以及相关信息的存在。我们的研究为分析推理任务的复杂性以及未来大型语言模型的发展方向提供了宝贵见解。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日