Pragmatic reasoning is pervasive in human-human communication - it allows us to leverage shared knowledge and counterfactual reasoning in order to infer the intention of a conversational partner given their ambiguous or underspecified message. In human-computer communication, underspecified messages often represent a major challenge: for instance, translating natural language instructions into code is difficult when user instructions contain inherent ambiguities. In the present paper, we aim to scale up the pragmatic "Rational Speech Act" framework to naturalistic language-to-code problems, and propose a way of dealing with multiple meaning-equivalent instruction alternatives, an issue that does not arise in previous toy-scale problems. We evaluate our method, CodeRSA, with two recent LLMs (Llama-3-8B-Instruct and Qwen-2.5-7B-Instruct) on two widely used code generation benchmarks (HumanEval and MBPP). Our experimental results show that CodeRSA consistently outperforms common baselines, surpasses the state-of-the-art approach in most cases, and demonstrates robust overall performance. Qualitative analyses demonstrate that it exhibits the desired behavior for the right reasons. These findings underscore the effectiveness of integrating pragmatic reasoning into a naturalistic complex communication task, language-to-code generation, offering a promising direction for enhancing code generation quality in LLMs and emphasizing the importance of pragmatic reasoning in complex communication settings.
翻译:实用推理在人类交流中无处不在——它使我们能够利用共享知识和反事实推理,从对话伙伴模糊或未充分说明的信息中推断其意图。在人机交互中,未充分说明的信息常常构成重大挑战:例如,当用户指令存在固有歧义时,将自然语言指令转化为代码就变得十分困难。本文旨在将实用的"理性言语行为"框架扩展到自然语言转代码的实际问题中,并提出一种处理多个语义等价指令变体的方法,这在先前的小规模问题中并未出现。我们在两个广泛使用的代码生成基准(HumanEval和MBPP)上,使用两种近期的大语言模型(Llama-3-8B-Instruct和Qwen-2.5-7B-Instruct)评估了所提出的CodeRSA方法。实验结果表明,CodeRSA始终优于常见基线方法,在多数情况下超越了当前最优方法,并展现出稳健的整体性能。定性分析表明,该方法基于正确的原因表现出期望的行为。这些发现证实了将实用推理整合到自然语言转代码生成这一复杂实际通信任务中的有效性,为提升大语言模型的代码生成质量提供了有前景的研究方向,并凸显了实用推理在复杂通信场景中的重要性。