Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis

ChatGPT and other large language models (LLMs) promise to revolutionize software development by automatically generating code from program specifications. We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode, a popular platform with algorithmic coding challenges for technical interview practice, across three difficulty levels: easy, medium, and hard. We test three main hypotheses. First, ChatGPT solves fewer problems as difficulty rises (Hypothesis 1). Second, prompt engineering improves ChatGPT's performance, with greater gains on easier problems and diminishing returns on harder ones (Hypothesis 2). Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket (Hypothesis 3). To investigate these hypotheses, we conduct automated experiments using Python scripts to generate prompts that instruct ChatGPT to create Python solutions. These solutions are stored and manually submitted on LeetCode to check their correctness. For Hypothesis 1, results show the GPT-3.5-turbo model successfully solves 92% of easy, 79% of medium, and 51% of hard problems. For Hypothesis 2, prompt engineering yields improvements: 14-29% for Chain of Thought Prompting, 38-60% by providing failed test cases in a second feedback prompt, and 33-58% by switching to GPT-4. From a random subset of problems ChatGPT solved in Python, it also solved 78% in Java, 50% in C++, and none in Elixir, Erlang, or Racket. These findings generally validate all three hypotheses.

翻译：ChatGPT及其他大型语言模型（LLMs）有望通过根据程序规约自动生成代码来革新软件开发。我们在LeetCode（一个为技术面试练习提供算法编程挑战的流行平台）上，评估了ChatGPT的GPT-3.5-turbo模型在三个难度级别（简单、中等、困难）上的表现。我们检验了三个主要假设。首先，随着难度增加，ChatGPT解决的问题数量减少（假设1）。其次，提示工程能提升ChatGPT的性能，在简单问题上提升更显著，而在困难问题上收益递减（假设2）。第三，ChatGPT在Python、Java和C++等流行语言上的表现优于Elixir、Erlang和Racket等较不常见的语言（假设3）。为探究这些假设，我们使用Python脚本进行自动化实验，生成提示指令让ChatGPT创建Python解决方案。这些解决方案被存储并通过人工提交至LeetCode以检验其正确性。对于假设1，结果显示GPT-3.5-turbo模型成功解决了92%的简单问题、79%的中等问题以及51%的困难问题。对于假设2，提示工程带来了性能提升：思维链提示提升14-29%，在第二次反馈提示中提供失败测试用例提升38-60%，切换至GPT-4提升33-58%。在ChatGPT用Python解决的随机问题子集中，它也能解决78%的Java问题、50%的C++问题，但在Elixir、Erlang或Racket中未能解决任何问题。这些发现总体上验证了所有三个假设。