LLMs have substantially improved software engineering yet real-world development requires architectural understanding. Such understanding is prohibitively expensive to label manually and impossible to verify through tests alone. We propose an agentic judging pipeline using a strong LLM as a scalable proxy for expert architectural evaluation, comprising two judges: the Architecture Complexity Judge (ACJ), which estimates codebase-specific architectural understanding a task demands, and the Architecture Quality Judge (AQJ), which evaluates patch conformance to repository-specific architectural conventions via source-grounded rubrics. Fine-tuning Qwen3-8B/14B/32B on 3,360 curated instances achieves resolved rates of up to 27.2% on SWE-bench Verified - up to 540% over the base model and 256% over unfiltered fine-tuning. Meanwhile, the trained models achieve strong cross-language generalization and consistent improvements in architectural patch quality.
翻译:大语言模型已显著提升软件工程能力,但实际开发仍需架构理解能力。此类理解的人工标注成本极高,且无法仅通过测试验证。我们提出一种代理评判流水线,利用强大语言模型作为专家架构评估的可扩展代理,包含两个评判模块:架构复杂度评判器(ACJ),用于评估任务所需的代码库特定架构理解程度;以及架构质量评判器(AQJ),通过基于源代码的评分准则评估补丁对仓库特定架构惯例的符合性。在3360个精选实例上微调Qwen3-8B/14B/32B后,模型在SWE-bench Verified上达到了27.2%的解决率——较基础模型提升540%,较未过滤微调提升256%。同时,训练后的模型展现出强大的跨语言泛化能力,并在架构补丁质量上实现了一致性改进。