Large language models (LLMs) excel at general programming but struggle with domain-specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing domain-specific code benchmarks cannot evaluate the effectiveness of domain specialization methods, which focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO-BENCH, a novel benchmark designed for evaluating domain specialization methods in real-world software development. KOCO-BENCH contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi-granularity evaluation tasks including domain code generation (from function-level to project-level with rigorous test suites) and domain knowledge understanding (via multiple-choice Q&A). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO-BENCH requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from knowledge corpora to solve evaluation tasks. Our evaluations reveal that KOCO-BENCH poses significant challenges to state-of-the-art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN-LM) applied, improvements remain marginal. Best-performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO-BENCH, evaluation code, and baselines to advance further research at https://github.com/jiangxxxue/KOCO-bench.
翻译:大语言模型(LLMs)在通用编程方面表现出色,但在特定领域的软件开发中面临挑战,因此需要领域专业化方法使LLMs能够学习和利用领域知识及数据。然而,现有的领域特定代码基准测试无法评估领域专业化方法的有效性,因为它们侧重于评估LLMs已具备的知识,而非其获取和应用新知识的能力,且缺乏用于开发领域专业化方法的显式知识语料库。为此,我们提出KOCO-BENCH,这是一个专为评估真实软件开发中领域专业化方法而设计的新型基准测试。KOCO-BENCH涵盖6个新兴领域、11个软件框架和25个项目,提供了精心整理的知识语料库以及多粒度评估任务,包括领域代码生成(从函数级到项目级,并配有严格测试套件)和领域知识理解(通过选择题问答)。与以往仅提供测试集用于直接评估的基准测试不同,KOCO-BENCH要求从知识语料库中获取并应用多样化的领域知识(如API、规则、约束等)来解决评估任务。我们的评估表明,KOCO-BENCH对最先进的LLMs提出了重大挑战。即使应用了领域专业化方法(如SFT、RAG、kNN-LM),性能提升仍然有限。表现最佳的编码代理Claude Code仅达到34.2%的准确率,这凸显了对更有效领域专业化方法的迫切需求。我们已在https://github.com/jiangxxxue/KOCO-bench 发布了KOCO-BENCH、评估代码和基线模型,以推动进一步研究。