Chain-of-Thought (CoT) plays a crucial role in reasoning for math problem solving. We conduct a comprehensive examination of methods for designing CoT, comparing conventional natural language CoT with various program CoTs, including the self-describing program, the comment-describing program, and the non-describing program. Furthermore, we investigate the impact of programming language on program CoTs, comparing Python and Wolfram Language. Through extensive experiments on GSM8K, MATHQA, and SVAMP, we find that program CoTs often have superior effectiveness in math problem solving. Notably, the best performing combination with 30B parameters beats GPT-3.5-turbo by a significant margin. The results show that self-describing program offers greater diversity and thus can generally achieve higher performance. We also find that Python is a better choice of language than Wolfram for program CoTs. The experimental results provide a valuable guideline for future CoT designs that take into account both programming language and coding style for further advancements. Our datasets and code are publicly available.
翻译:思维链(CoT)在数学问题求解的推理过程中起着关键作用。我们对设计思维链的方法进行了全面考察,比较了传统自然语言思维链与多种程序思维链,包括自描述程序、注释描述程序和非描述程序。此外,我们研究了编程语言对程序思维链的影响,比较了Python和Wolfram语言。通过在GSM8K、MATHQA和SVAMP上的大量实验,我们发现程序思维链在数学问题求解中通常具有更优的效果。值得注意的是,具有300亿参数的最佳组合以显著优势超越了GPT-3.5-turbo。结果表明,自描述程序提供了更高的多样性,因此通常能够实现更好的性能。我们还发现,对于程序思维链,Python是比Wolfram更优的编程语言选择。这些实验结果为进一步考虑编程语言和编码风格来优化思维链设计提供了有价值的指导。我们的数据集和代码均已公开。