Program of Thoughts (PoT) is an approach characterized by its executable intermediate steps, which ensure the accuracy of the numerical calculations in the reasoning process. Currently, PoT primarily uses Python. However, relying solely on a single language may result in suboptimal solutions and overlook the potential benefits of other programming languages. In this paper, we conduct comprehensive experiments on the programming languages used in PoT and find that no single language consistently delivers optimal performance across all tasks and models. The effectiveness of each language varies depending on the specific scenarios. Inspired by this, we propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. Furthermore, it achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models. In particular, MultiPoT achieves more than 4.6\% improvement on average on both Starcoder and ChatGPT (gpt-3.5-turbo).
翻译:思维程式(Program of Thoughts, PoT)是一种以可执行中间步骤为特征的方法,能够确保推理过程中数值计算的准确性。目前,PoT主要使用Python。然而,仅依赖单一语言可能导致次优解,并忽视其他编程语言的潜在优势。本文对PoT中使用的编程语言进行了全面实验,发现没有一种语言能始终在所有任务和模型中取得最优性能,每种语言的有效性取决于具体场景。受此启发,我们提出了一种与任务和模型无关的方法MultiPoT,通过融合多种语言的优势与多样性,实验结果表明其显著优于Python自洽性方法。此外,在几乎所有任务和模型中,MultiPoT均能达到与最优单语言PoT相当甚至更优的性能。特别地,在Starcoder和ChatGPT(gpt-3.5-turbo)上,MultiPoT的平均性能提升超过4.6%。