Benchmarking Language Model Creativity: A Case Study on Code Generation

As LLMs become increasingly prevalent, it is interesting to consider how ``creative'' these models can be. From cognitive science, creativity consists of at least two key characteristics: \emph{convergent} thinking (purposefulness to achieve a given goal) and \emph{divergent} thinking (adaptability to new environments or constraints) \citep{runco2003critical}. In this work, we introduce a framework for quantifying LLM creativity that incorporates the two characteristics. This is achieved by (1) Denial Prompting pushes LLMs to come up with more creative solutions to a given problem by incrementally imposing new constraints on the previous solution, compelling LLMs to adopt new strategies, and (2) defining and computing the NeoGauge metric which examines both convergent and divergent thinking in the generated creative responses by LLMs. We apply the proposed framework on Codeforces problems, a natural data source for collecting human coding solutions. We quantify NeoGauge for various proprietary and open-source models and find that even the most creative model, GPT-4, still falls short of demonstrating human-like creativity. We also experiment with advanced reasoning strategies (MCTS, self-correction, etc.) and observe no significant improvement in creativity. As a by-product of our analysis, we release NeoCoder dataset for reproducing our results on future models.

翻译：随着大型语言模型（LLM）日益普及，探讨这些模型能有多“创造性”成为一个有趣的问题。认知科学指出，创造力至少包含两个关键特征：\emph{聚合性思维}（为实现特定目标的目的性）和\emph{发散性思维}（适应新环境或约束的能力）\citep{runco2003critical}。本研究提出一个量化LLM创造力的框架，该框架融合了这两个特征。具体通过以下方式实现：（1）否定提示法：通过对先前解决方案逐步施加新约束，迫使LLM采用新策略，从而推动其为给定问题提出更具创造性的解决方案；（2）定义并计算NeoGauge指标：该指标从LLM生成的创造性响应中同时考察聚合性与发散性思维。我们将所提框架应用于Codeforces编程问题——这是一个收集人类编码解决方案的天然数据源。通过量化多种专有模型与开源模型的NeoGauge值，我们发现即使最具创造性的模型GPT-4，仍未能展现出类人水平的创造力。我们还尝试了高级推理策略（如MCTS、自我修正等），但未观察到创造力有显著提升。作为本研究的副产品，我们发布了NeoCoder数据集，以便在未来模型上复现我们的实验结果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日