Large language models (LLMs) excel at general programming but struggle with domain-specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing domain-specific code benchmarks cannot evaluate the effectiveness of domain specialization methods, which focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO-BENCH, a novel benchmark designed for evaluating domain specialization methods in real-world software development. KOCO-BENCH contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi-granularity evaluation tasks including domain code generation (from function-level to project-level with rigorous test suites) and domain knowledge understanding (via multiple-choice Q&A). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO-BENCH requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from knowledge corpora to solve evaluation tasks. Our evaluations reveal that KOCO-BENCH poses significant challenges to state-of-the-art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN-LM) applied, improvements remain marginal. Best-performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO-BENCH, evaluation code, and baselines to advance further research at https://github.com/jiangxxxue/KOCO-bench.
翻译:大型语言模型(LLM)在通用编程任务上表现出色,但在特定领域的软件开发中仍面临困难,因此需要领域专业化方法使LLM能够学习并利用领域知识与数据。然而,现有的领域特定代码基准无法有效评估领域专业化方法的效能,这些基准主要关注评估LLM具备何种知识,而非其如何获取与应用新知识,且缺乏用于开发领域专业化方法的显式知识库。为此,我们提出了KOCO-BENCH,一个专为评估真实世界软件开发中领域专业化方法而设计的新型基准。KOCO-BENCH涵盖6个新兴领域、11个软件框架和25个项目,包含精心构建的知识库以及多粒度评估任务,包括领域代码生成(从函数级到项目级,配备严格的测试套件)和领域知识理解(通过多项选择题问答)。与以往仅提供测试集用于直接评估的基准不同,KOCO-BENCH要求从知识库中获取并应用多样化的领域知识(如API、规则、约束等)以解决评估任务。我们的评估表明,KOCO-BENCH对当前最先进的LLM构成了显著挑战。即使应用了领域专业化方法(如SFT、RAG、kNN-LM),性能提升仍然有限。表现最佳的编码智能体Claude Code仅达到34.2%的得分,凸显了对更有效领域专业化方法的迫切需求。我们在https://github.com/jiangxxxue/KOCO-bench 发布了KOCO-BENCH、评估代码及基线模型,以推动进一步研究。