Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines. While attractive for scalability, LaaJs tend to overlook domain specific issues raising concerns about their reliability in critical evaluation tasks. To better understand these limitations in practice, we examine LaaJ behavior in a concrete industrial use case: legacy code modernization via COBOL code generation. In this setting, we find that even production deployed LaaJs can miss domain critical errors, revealing consistent blind spots in their evaluation capabilities. To better understand these blind spots, we analyze generated COBOL programs and associated LaaJs judgments, drawing on expert knowledge to construct a preliminary taxonomy. Based on this taxonomy, we develop a lightweight analytic checker tool that flags over 30 domain specific issues observed in practice. We use its outputs as analytic hints, dynamically injecting them into the judges prompt to encourage LaaJ to revisit aspects it may have overlooked. Experiments on a test set of 100 programs using four production level LaaJs show that LaaJ alone detects only about 45-63% of the errors present in the code (in all judges we tested), while the analytic checker alone lacks explanatory depth. When combined, the LaaJ+Hints configuration achieves up to 74% coverage (for the best performing judge and injection prompt) and produces qualitatively richer, more accurate explanations, demonstrating that analytic-LLM hybrids can substantially enhance evaluation reliability in deployed pipelines. We release the dataset and all used prompts.
翻译:大型语言模型正越来越多地被部署为代码生成流程中的评判器。尽管这种部署在可扩展性方面具有吸引力,但LLaJ往往忽视特定领域的问题,这引发了对其在关键评估任务中可靠性的担忧。为了在实践中更好地理解这些局限性,我们考察了LLaJ在一个具体工业用例中的表现:通过COBOL代码生成实现遗留代码现代化。在此场景中,我们发现即使是生产环境部署的LLaJ也可能遗漏领域关键错误,揭示了其评估能力中持续存在的盲区。为深入理解这些盲区,我们分析了生成的COBOL程序及相关LLaJ判断,借助专家知识构建了初步分类体系。基于该分类体系,我们开发了一个轻量级分析检查工具,能够标记实践中观察到的30余种领域特定问题。我们将其输出作为分析提示,动态注入评判器的提示中,以促使LLaJ重新审视可能忽略的方面。在包含100个程序的测试集上使用四个生产级LLaJ进行的实验表明:单独使用LLaJ仅能检测代码中约45-63%的错误(所有测试的评判器均如此),而单独使用分析检查器则缺乏解释深度。当两者结合时,LLaJ+提示配置最高可实现74%的覆盖率(针对性能最佳的评判器和注入提示),并产生质量更高、更准确的解释,这证明分析-LLM混合系统能显著提升部署流程中的评估可靠性。我们已公开数据集及所有使用的提示。