Chain-of-Thought (CoT) plays a crucial role in reasoning for math problem solving. We conduct a comprehensive examination of methods for designing CoT, comparing conventional natural language CoT with various program CoTs, including the self-describing program, the comment-describing program, and the non-describing program. Furthermore, we investigate the impact of programming language on program CoTs, comparing Python and Wolfram Language. Through extensive experiments on GSM8K, MATHQA, and SVAMP, we find that program CoTs often have superior effectiveness in math problem solving. Notably, the best performing combination with 30B parameters beats GPT-3.5-turbo by a significant margin. The results show that self-describing program offers greater diversity and thus can generally achieve higher performance. We also find that Python is a better choice of language than Wolfram for program CoTs. The experimental results provide a valuable guideline for future CoT designs that take into account both programming language and coding style for further advancements. Our datasets and code are publicly available.
翻译:思维链(Chain-of-Thought, CoT)在数学问题求解的推理过程中起着关键作用。我们对CoT设计方法进行了全面研究,比较了传统自然语言CoT与多种程序CoT(包括自描述程序、注释描述程序和非描述程序)。此外,我们探讨了编程语言对程序CoT的影响,对比了Python和Wolfram语言。通过在GSM8K、MATHQA和SVAMP上的广泛实验,我们发现程序CoT在数学问题求解中往往具有更优的效果。值得注意的是,具有30B参数的最佳组合显著超越了GPT-3.5-turbo。结果表明,自描述程序提供了更大的多样性,因此通常能实现更高性能。我们还发现,对于程序CoT,Python是比Wolfram更优的编程语言选择。这些实验结果为未来考虑编程语言和编码风格的CoT设计提供了宝贵指导,以推动进一步进展。我们的数据集和代码已公开发布。