LLMs are widely used for code generation and mathematical reasoning tasks where they are required to generate structured output. They either need to reason about code, generate code for a given specification, or reason using programs of thought. The typical approach to code generation is to prompt the model and generate samples until an appropriate program is obtained. Within this process, sampling $n$ programs from the language model requires $n$ GPU compute-intensive generations which becomes prohibitively expensive for larger values of $n$. In this work, we address this limitation by exposing the LLM's distribution within the generated programs themselves. We propose a novel test-time framework we dub probabilistic programs of thought to obtain more samples from the model with fewer LLM generations. Given a program generated by a model and the associated next-token probabilities, we build a probabilistic program that compactly represents exponentially many deterministic programs. Since performing probabilistic reasoning in this probabilistic program is much cheaper, our approach allows sampling new programs without any additional GPU compute and little CPU overhead. We instantiate our approach on benchmarks for code generation, code understanding and mathematical reasoning and report improvements in performance with fewer generations from the LLM.
翻译:大语言模型广泛应用于代码生成和数学推理任务,这类任务要求模型生成结构化输出。模型需要完成代码推理、根据规范生成代码,或通过思维程序进行推理。典型的代码生成方法是提示模型并生成样本,直至获得合适的程序。在此过程中,从语言模型中采样$n$个程序需要耗费大量GPU计算资源,随着$n$增大成本急剧上升。本研究通过将大语言模型的概率分布显式嵌入生成的程序本身来解决此限制。我们提出一种新颖的测试时框架——概率思维程序,能够以更少的大语言模型生成次数获取更多样本。基于模型生成的程序及其关联的下一词元概率,我们构建可紧凑表征指数级数量确定性程序的概率程序。由于在该概率程序中进行概率推理的成本显著降低,我们的方法无需额外GPU计算且仅需少量CPU开销即可采样新程序。我们在代码生成、代码理解和数学推理基准上实施该方法,实验表明,在减少大语言模型生成次数的前提下,模型性能仍获得提升。