Language models (LMs) can solve tasks such as answering questions about tables or images by writing programs. However, using primitive functions often leads to verbose and error-prone programs, and higher-level functions require expert design. To enable better solutions without human labor, we ask code LMs to curate reusable high-level functions, and use them to write solutions. We present TROVE, a training-free method of inducing a verifiable and efficient toolbox of functions, by generating via using, growing, and periodically trimming the toolbox. On 11 datasets from math, table question answering, and image reasoning tasks, TROVE consistently yields simpler solutions with higher accuracy than baselines using CODELLAMA and previous methods using GPT, while using 79-98% smaller toolboxes. TROVE further enables 31% faster and 13% more accurate human verification than baselines. With the same pipeline, it creates diverse functions for varied tasks and datasets, providing insights into their individual characteristics.
翻译:语言模型(LMs)可通过编写程序解决诸如表格或图像问答等任务。然而,使用原始函数常导致程序冗长且易出错,而高级函数则需要专家设计。为无需人工干预地实现更优解决方案,我们要求代码语言模型整理可复用的高级函数,并利用其编写解决方案。我们提出TROVE——一种无需训练的方法,通过使用、扩展并定期修剪工具箱来诱导可验证且高效的函数工具箱。在涵盖数学、表格问答及图像推理任务的11个数据集中,TROVE始终能生成比使用CodeLlama及先前基于GPT方法更简明的解决方案,且准确率更高,同时工具箱规模缩小79-98%。此外,TROVE使人工验证速度提升31%,准确率提升13%。通过相同流程,它能为不同任务和数据集创建多样化函数,从而揭示其各自特征。