Despite the rapid progress of large language models (LLMs) in code generation, existing evaluations focus on functional correctness or syntactic validity, overlooking how LLMs make critical design choices such as which library or programming language to use. To fill this gap, we perform the first empirical study of LLMs' preferences for libraries and programming languages when generating code, covering eight diverse LLMs. We observe a strong tendency to overuse widely adopted libraries such as NumPy; in up to 45% of cases, this usage is not required and deviates from the ground-truth solutions. The LLMs we study also show a significant preference toward Python as their default language. For high-performance project initialisation tasks where Python is not the optimal language, it remains the dominant choice in 58% of cases, and Rust is not used once. These results highlight how LLMs prioritise familiarity and popularity over suitability and task-specific optimality; underscoring the need for targeted fine-tuning, data diversification, and evaluation benchmarks that explicitly measure language and library selection fidelity.
翻译:尽管大语言模型在代码生成方面取得了快速进展,但现有评估主要关注功能正确性或语法有效性,忽略了模型如何做出关键设计选择(如使用哪个库或编程语言)。为填补这一空白,我们首次对八种不同大语言模型在生成代码时的库和编程语言偏好进行了实证研究。研究发现,模型强烈倾向于过度使用NumPy等广泛采用的库;在高达45%的案例中,这种使用并非必要且偏离了真实解决方案。所研究的大语言模型还显著偏好将Python作为默认编程语言。在Python并非最优选择的高性能项目初始化任务中,Python在58%的案例中仍占主导地位,而Rust从未被使用。这些结果揭示了模型优先考虑熟悉度和流行度而非适用性和任务最优性的特点,强调需要开展针对性微调、数据多样化以及明确衡量语言与库选择准确性的评估基准。