AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Ori Press,Brandon Amos,Haoyu Zhao,Yikai Wu,Samuel K. Ainsworth,Dominik Krupke,Patrick Kidger,Touqir Sajed,Bartolomeo Stellato,Jisun Park,Nathanael Bosch,Eli Meril,Albert Steppi,Arman Zharmagambetov,Fangzhao Zhang,David Perez-Pineiro,Alberto Mercurio,Ni Zhan,Talor Abramovich,Kilian Lieret,Hanlin Zhang,Shirley Huang,Matthias Bethge,Ofir Press

Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 154 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner uses a simple, budgeted loop that edits code, compiles and runs it, profiles performance, verifies correctness on tests, and selects the fastest valid version. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.

翻译：尽管语言模型（LM）能力已取得进展，但现有评估主要关注模型在人类已解决任务上的表现，包括编程（Jimenez等人，2024）和数学（Glazer等人，2024）领域。为此，我们提出在开放式基准测试中检验模型设计与实现算法的能力：要求语言模型编写能高效解决计算机科学、物理学和数学领域计算挑战性问题的代码。AlgoTune基准包含154个从领域专家收集的编程任务，以及用于验证和计时LM合成解决方案代码的框架，这些代码将与主流开源软件包的参考实现进行对比。此外，我们开发了基线LM智能体AlgoTuner，并在前沿模型套件中评估其性能。AlgoTuner采用简单的预算循环机制，通过编辑代码、编译运行、性能剖析、测试验证等步骤，最终选择最快的有效版本。相较于使用SciPy、sk-learn和CVXPY等库的参考求解器，AlgoTuner实现了平均1.72倍的加速。然而，我们发现当前模型未能实现算法创新，更倾向于表面层级的优化。我们期待AlgoTune能推动开发出具备创造性问题解决能力、超越人类最先进水平的语言模型智能体。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日