In the scenario-based evaluation of machine learning models, a key problem is how to construct test datasets that represent various scenarios. The methodology proposed in this paper is to construct a benchmark and attach metadata to each test case. Then a test system can be constructed with test morphisms that filter the test cases based on metadata to form a dataset. The paper demonstrates this methodology with large language models for code generation. A benchmark called ScenEval is constructed from problems in textbooks, an online tutorial website and Stack Overflow. Filtering by scenario is demonstrated and the test sets are used to evaluate ChatGPT for Java code generation. Our experiments found that the performance of ChatGPT decreases with the complexity of the coding task. It is weakest for advanced topics like multi-threading, data structure algorithms and recursive methods. The Java code generated by ChatGPT tends to be much shorter than reference solution in terms of number of lines, while it is more likely to be more complex in both cyclomatic and cognitive complexity metrics, if the generated code is correct. However, the generated code is more likely to be less complex than the reference solution if the code is incorrect.
翻译:在机器学习模型的场景化评估中,一个关键问题是如何构建能够代表不同场景的测试数据集。本文提出的方法是通过构建基准数据集并为每个测试用例附加元数据来实现。随后,测试系统可通过测试态射(test morphisms)根据元数据筛选测试用例以形成数据集。本文以代码生成的大语言模型为例演示了该方法。通过整合教科书、在线教程网站和Stack Overflow中的编程问题,构建了名为ScenEval的基准数据集。研究展示了基于场景的筛选过程,并利用生成的测试集评估了ChatGPT在Java代码生成任务上的表现。实验发现,ChatGPT的性能随编码任务复杂度的增加而下降,其在多线程、数据结构算法和递归方法等高级主题上的表现最为薄弱。就代码行数而言,ChatGPT生成的Java代码往往显著短于参考答案;若生成代码正确,其在圈复杂度和认知复杂度指标上通常更为复杂。然而,若生成代码存在错误,其复杂度则倾向于低于参考答案。