This paper presents a comprehensive performance evaluation of Large Language Models (LLMs) in solving programming challenges from Leetcode, a widely used platform for algorithm practice and technical interviews. We began by crawling the Leetcode website to collect a diverse set of problems encompassing various difficulty levels and topics. Using this dataset, we generated solutions with multiple LLMs, including GPT-4 and GPT-3.5-turbo (ChatGPT-turbo). The generated solutions were systematically evaluated for correctness and efficiency. We employed the pass@k metric to assess the success rates within a given number of attempts and analyzed the runtime performance of the solutions. Our results highlight the strengths and limitations of current LLMs [10] in code generation and problem-solving tasks, providing insights into their potential applications and areas for improvement in automated programming assistance.
翻译:本文对大型语言模型(LLMs)在解决LeetCode编程挑战中的性能进行了全面评估。LeetCode是一个广泛用于算法练习和技术面试的平台。我们首先通过爬取LeetCode网站收集了涵盖不同难度级别和主题的多样化问题集。利用该数据集,我们使用多种LLM(包括GPT-4和GPT-3.5-turbo(ChatGPT-turbo))生成解决方案。对生成的解决方案进行了系统性的正确性和效率评估。我们采用pass@k指标来评估给定尝试次数内的成功率,并分析了解决方案的运行时间性能。我们的研究结果揭示了当前LLMs在代码生成和问题求解任务中的优势与局限,为自动化编程辅助的潜在应用场景和改进方向提供了见解。