Large language models (LLMs) such as ChatGPT are increasingly proficient in understanding and generating a mixture of code and text. Evaluation based on such $\textit{mixture}$ can lead to a more comprehensive understanding of the models' abilities in solving coding problems. However, in this context, current evaluation methods are either limited in task coverage or lack standardization. To address this issue, we propose using category theory as a framework for evaluation. Specifically, morphisms within a code category can represent code debugging and transformation, functors between two categories represent code translation, and functors between a code category and a natural language category represent code generation, explanation, and reproduction. We present an automatic evaluation framework called $\textbf{CatCode}$ ($\textbf{Cat}$egory $\textbf{Code}$) that can comprehensively assess the coding abilities of LLMs, including ChatGPT, Text-Davinci, and CodeGeeX.
翻译:摘要:诸如ChatGPT等大语言模型在理解与生成混合代码及文本方面展现出日益增强的能力。基于这种《混合》内容的评估,能够更全面地理解模型在解决编程问题时的能力。然而,当前评估方法在该语境下要么任务覆盖范围有限,要么缺乏标准化。为解决此问题,我们提出采用范畴论作为评估框架。具体而言,代码范畴内的态射可表征代码调试与转换,两个范畴间的函子表征代码翻译,而代码范畴与自然语言范畴之间的函子则表征代码生成、解释与复现。我们提出名为《CatCode》(《Cat》egory《Code》)的自动化评估框架,该框架能全面评估包括ChatGPT、Text-Davinci及CodeGeeX在内的大语言模型的编程能力。