RAL-Bench: Benchmarking for Application-Level Functional Correctness and Non-Functional Quality Attributes

Code generation has advanced rapidly with code-focused large language models (LLMs), especially on snippet-level tasks. However, application-level generation requires producing a runnable multi-file repository with correct structure, dependencies, and end-to-end executability, and real-world software must satisfy both functional correctness and non-functional quality (e.g., maintainability, security). Existing benchmarks provide a limited execution-based assessment of these requirements at the application level. We ask: Can current LLMs generate application-level repositories that meet both functional and non-functional criteria? We propose RAL-Bench, a benchmark and evaluation framework for application-level code generation. For each task, we distill a concise natural-language requirement from a high-quality reference project, build black-box system tests covering functional and non-functional attributes, and keep only tests that pass on the reference repository to ensure a sound oracle and an end-to-end executable suite. Functional correctness is measured by system-test pass rate. Non-functional quality is measured along five ISO/IEC 25010-inspired dimensions and aggregated with an Analytic Hierarchy Process (AHP)-derived weight vector, with per-dimension diagnostics and baseline-normalized scoring using reference measurements. Across 16 LLMs evaluated zero-shot with greedy decoding, functional correctness is the dominant bottleneck: no model exceeds a 45% functional pass rate under our requirement-driven, reference-validated tests. We release RAL-Bench at https://github.com/Wwstarry/RAL-Bench. .

翻译：随着专注于代码的大型语言模型（LLMs）的快速发展，代码生成能力，尤其是在片段级任务上，取得了显著进步。然而，应用级代码生成需要产出具有正确结构、依赖关系和端到端可执行性的多文件可运行仓库，且现实世界中的软件必须同时满足功能正确性和非功能性质量（例如可维护性、安全性）的要求。现有基准测试在应用级别对这些需求提供的基于执行的评估较为有限。我们提出疑问：当前的LLMs能否生成同时满足功能性和非功能性标准的应用级仓库？为此，我们提出了RAL-Bench，一个面向应用级代码生成的基准测试与评估框架。针对每个任务，我们从高质量参考项目中提炼出简洁的自然语言需求，构建覆盖功能性与非功能性属性的黑盒系统测试，并仅保留在参考仓库上通过的测试，以确保可靠的预言机和一个端到端可执行的测试套件。功能正确性通过系统测试的通过率来衡量。非功能性质量则沿五个受ISO/IEC 25010启发的维度进行度量，并使用通过层次分析法（AHP）导出的权重向量进行聚合，每个维度均提供诊断信息，并利用参考测量值进行基线归一化评分。在对16个LLM进行零样本贪婪解码评估后，我们发现功能正确性是主要的瓶颈：在我们基于需求驱动、参考验证的测试下，没有任何模型的功能通过率超过45%。我们已在 https://github.com/Wwstarry/RAL-Bench 发布RAL-Bench。