Perish or Flourish? A Holistic Evaluation of Large Language Models for Code Generation in Functional Programming

Functional programming provides strong foundations for developing reliable and secure software systems, yet its adoption remains not widespread due to the steep learning curve. Recent advances in Large Language Models (LLMs) for code generation present new opportunities to lower these barriers. However, extensive evaluations of LLMs largely focus on imperative programming languages, and their capabilities in functional programming languages (FP) remain underexplored. To address this gap, we introduce FPEval, a holistic evaluation framework built on FPBench, a new benchmark of 721 programming tasks across three difficulty levels on three mainstream FP languages: Haskell, Ocaml and Scala. FPEval provides compehensive evaluation infrastructures with both test validations with comprehensive test suites and static analysis tools to assess both functional correctness and code style and maintainability. Using this framework, we evaluate state-of-the-art LLMs, including GPT-3.5, GPT-4o, and GPT-5, for code generation in functional programming languages and Java as an imperative baseline. Our results demonstrate that LLM performance in functional programming improves substantially with model advancement; however, error rates remain significantly higher in purely functional languages (Haskell and OCaml) than in hybrid (Scala) or imperative (Java) languages. Moreover, LLMs frequently generate non-idiomatic functional code that follows imperative patterns, raising concerns about code style and long-term maintainability. Finally, we show that LLMs can partially self-repair both correctness and quality issues when provided with static analysis feedback and hand-crafted instructions for common types of issues.

翻译：函数式编程为开发可靠且安全的软件系统提供了坚实基础，但由于其陡峭的学习曲线，其采用仍不广泛。近期用于代码生成的大语言模型（LLMs）的进展为降低这些障碍提供了新的机遇。然而，现有对LLMs的大规模评估主要集中于命令式编程语言，其在函数式编程语言（FP）中的能力仍未得到充分探索。为填补这一空白，我们提出了FPEval，这是一个基于FPBench构建的整体性评估框架。FPBench是一个包含721个编程任务的新基准，涵盖三种主流函数式语言（Haskell、Ocaml和Scala）的三个难度级别。FPEval提供了全面的评估基础设施，包括基于综合测试套件的测试验证以及静态分析工具，以评估功能正确性、代码风格和可维护性。利用该框架，我们评估了包括GPT-3.5、GPT-4o和GPT-5在内的最先进LLM在函数式编程语言中的代码生成能力，并以Java作为命令式基线。我们的结果表明，随着模型进步，LLM在函数式编程中的性能显著提升；然而，在纯函数式语言（Haskell和OCaml）中的错误率仍显著高于混合式（Scala）或命令式（Java）语言。此外，LLM经常生成遵循命令式模式的非惯用函数式代码，这引发了关于代码风格和长期可维护性的担忧。最后，我们证明，当提供静态分析反馈以及针对常见问题类型的手工编写指令时，LLM能够部分自我修复功能正确性和代码质量问题。