This paper explores the cutting-edge Large Language Model with analytical reasoning on sports. Our analytical reasoning embodies the tasks of letting large language models count how many points each team scores in a quarter in the NBA and NFL games. Our major discoveries are in two folds. Firstly, we find among all the models we employed, GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind. Specifically, we compare three different prompting techniques and a divide-and-conquer approach, we find that the latter was the most effective. Our divide-and-conquer approach breaks down play-by-play data into smaller, more manageable segments, solves each piece individually, and then aggregates them together. Besides the divide-and-conquer approach, we also explore the Chain of Thought (CoT) strategy, which markedly improves outcomes for certain models, notably GPT-4 and Claude-2.1, with their accuracy rates increasing significantly. However, the CoT strategy has negligible or even detrimental effects on the performance of other models like GPT-3.5 and Gemini-Pro. Secondly, to our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores. This leads us to further investigate the factors that impact the complexity of analytical reasoning tasks with extensive experiments, through which we conclude that task complexity depends on the length of context, the information density, and the presence of related information. Our research provides valuable insights into the complexity of analytical reasoning tasks and potential directions for developing future large language models.
翻译:本文探讨了前沿大型语言模型在体育领域中的分析推理能力。我们的分析推理任务包括让大型语言模型统计NBA和NFL比赛中各队在每节比赛中的得分。主要发现体现在两个方面:首先,在所有使用的模型中,GPT-4表现最为有效,其次是Claude-2.1,而GPT-3.5、Gemini-Pro和Llama-2-70b则相对落后。具体而言,我们比较了三种不同的提示技术与分治方法,发现后者最为有效。我们的分治方法将逐回合数据分解为更小、更易管理的片段,分别求解每个片段,然后汇总结果。除分治方法外,我们还探索了思维链(CoT)策略,该策略显著提升了特定模型的性能,尤其是GPT-4和Claude-2.1,其准确率大幅提高。然而,CoT策略对其他模型(如GPT-3.5和Gemini-Pro)的效果微乎其微,甚至产生负面影响。其次,令我们惊讶的是,大多数模型(包括GPT-4)在准确统计NBA节总得分方面表现不佳,尽管在统计NFL节得分方面表现出色。这促使我们通过大量实验进一步探究影响分析推理任务复杂性的因素,最终得出结论:任务复杂性取决于上下文长度、信息密度以及相关信息的存在。我们的研究为分析推理任务的复杂性以及未来大型语言模型的发展方向提供了宝贵见解。