AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation

Besides humans and machines, Artificial Intelligence (AI) models have emerged to be another important audience of programming languages, as we come to the era of large language models (LLMs). LLMs can now excel at coding competitions and even program like developers to address various tasks, such as math calculation. Yet, the grammar and layout of existing programs are designed for humans. Particularly, abundant grammar tokens and formatting tokens are included to make the code more readable to humans. While beneficial, such a human-centric design imposes an unnecessary computational burden on LLMs where each token, either consumed or generated, consumes computational resources. To improve inference efficiency and reduce computational costs, we propose the concept of AI-oriented grammar, which aims to represent the code in a way that better suits the working mechanism of AI models. Code written with AI-oriented grammar discards formats and uses a minimum number of tokens to convey code semantics effectively. To demonstrate the feasibility of this concept, we explore and implement the first AI-oriented grammar for Python, named Simple Python (SimPy). SimPy is crafted by revising the original Python grammar through a series of heuristic rules. Programs written in SimPy maintain identical Abstract Syntax Tree (AST) structures to those in standard Python, allowing execution via a modified AST parser. In addition, we explore methods to enable existing LLMs to proficiently understand and use SimPy, and ensure the changes remain imperceptible for human developers. Compared with the original Python, SimPy not only reduces token usage by 13.5% and 10.4% for CodeLlama and GPT-4, but can also achieve equivalent, even improved, performance over the models trained on Python code.

翻译：除人类与机器外，人工智能模型已成为编程语言的重要受众——在大语言模型时代，LLM不仅能胜任编程竞赛，更能像开发者般编写程序处理各类任务（如数学计算）。然而，现有编程语言的语法与排版专为人类设计，包含大量提升代码可读性的语法标记与格式标记。这种以人为中心的设计虽具优势，却给LLM带来不必要的计算负担：每个消耗或生成的标记均需消耗计算资源。为提升推理效率并降低计算成本，我们提出面向AI的语法概念，旨在以更契合AI模型工作机制的方式表示代码。采用该语法的代码摒弃格式，以最少标记有效传递代码语义。为验证该概念的可行性，我们探索并实现了首个面向AI的Python语法——简便Python（SimPy）。SimPy通过系列启发式规则修订原生Python语法，其编写程序与标准Python保持完全相同的抽象语法树结构，可通过修改后的AST解析器执行。此外，我们探索了使现有LLM熟练理解与使用SimPy的方法，同时确保这些变更对人类开发者保持透明。相较于原始Python，SimPy使CodeLlama和GPT-4的标记使用量分别降低13.5%和10.4%，且基于Python代码训练的模型在使用SimPy时能实现同等乃至更优的性能表现。