Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language $\textit{question templates}$, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated $\textit{test oracle}$ that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a $\textit{neighbourhood}$ of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM's code generation abilities to be identified, including $\textit{anomalies}$ where the LLM correctly solves $\textit{almost all}$ questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting $\textit{robustness}$ issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.

翻译：我们提出了一种通过新基准测试“湍流”（Turbulence）系统化评估指令微调大语言模型（LLMs）在代码生成任务中正确性与鲁棒性的方法。湍流包含大量自然语言$\textit{问题模板}$，每个模板对应一个编程问题，并通过参数化使其能以多种不同形式被提出。每个问题模板关联一个$\textit{测试预言}$（test oracle），用于判断LLM返回的代码解决方案是否正确。因此，通过单一问题模板，可向LLM提出一个由高度相似的编程问题构成的$\textit{邻域}$（neighbourhood），并评估其对每个问题返回结果的正确性。这有助于识别LLM在代码生成能力中的不足，包括$\textit{异常现象}$——即LLM能正确解决邻域中$\textit{几乎全部}$问题，却对特定参数实例失败。我们针对来自OpenAI、Cohere和Meta的五种LLM（每种采用两个温度配置）进行了实验。结果表明，湍流能够全面揭示LLM推理能力的缺陷。这不仅仅是为了凸显LLM偶尔生成错误代码这一现象（这并不令人意外）：通过系统识别LLM能解决邻域部分问题但无法泛化至整个邻域的情况，我们的方法有效凸显了$\textit{鲁棒性}$问题。我们提供了数据和案例，阐明LLM在返回错误代码结果时常见错误类型。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日