MetaLint：通过指令遵循与由易到难泛化实现可泛化的惯用代码质量分析 (MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization)

Large Language Models, though successful in code generation, struggle with code quality analysis because they are limited by static training data and can't easily adapt to evolving best practices. We introduce MetaLint, an instruction-following framework that formulates code quality analysis as the task of detecting and fixing problematic semantic code fragments or code idioms based on high-level specifications. Unlike conventional approaches that train models on static code quality conventions, MetaLint employs instruction tuning on synthetic linter-generated data with dynamic conventions to support easy-to-hard generalization, enabling models to adapt to novel or complex code patterns without retraining. To evaluate this, we construct a benchmark of challenging idioms inspired by real-world coding standards such as Python Enhancement Proposals (PEPs) and assess whether MetaLint-trained models reason adaptively or simply memorize. Our results show that MetaLint training improves generalization to unseen idioms. Qwen3-4B attains a 70.37% F-score on a manually curated and challenging PEP idiom detection benchmark, achieving the highest recall (70.43%) among all evaluated models. For localization, it reaches 26.73%, which is a strong outcome for its 4B parameter size and comparable to larger state-of-the-art models such as o3-mini, highlighting its potential for future-proof code quality analysis. Furthermore, MetaLint training enables generalization in idiom detection across model families, model scales, synthetic data from diverse linters, and Java idioms, demonstrating the general applicability of our approach. We plan to release our code and data to enable reproducibility and further work.

翻译：大型语言模型虽然在代码生成方面取得了成功，但在代码质量分析方面却面临困难，因为它们受限于静态训练数据，难以适应不断发展的最佳实践。我们提出了MetaLint，这是一个遵循指令的框架，它将代码质量分析任务定义为基于高层规范检测和修复有问题的语义代码片段或代码惯用法。与在静态代码质量规范上训练模型的传统方法不同，MetaLint采用指令调优技术，利用具有动态规范的合成linter生成数据进行训练，以支持由易到难的泛化，使模型能够适应新颖或复杂的代码模式而无需重新训练。为了评估这一点，我们构建了一个受Python增强提案等现实世界编码标准启发的、具有挑战性的惯用法基准测试，并评估经过MetaLint训练的模型是进行适应性推理还是仅仅进行记忆。我们的结果表明，MetaLint训练提高了对未见惯用法的泛化能力。Qwen3-4B在人工精心策划且具有挑战性的PEP惯用法检测基准测试中获得了70.37%的F分数，在所有评估模型中实现了最高的召回率（70.43%）。在定位方面，其达到了26.73%，这对于其40亿参数规模来说是一个强有力的结果，并且可与o3-mini等更大的最先进模型相媲美，突显了其在面向未来的代码质量分析方面的潜力。此外，MetaLint训练能够实现跨模型家族、模型规模、来自不同linter的合成数据以及Java惯用法的惯用法检测泛化，证明了我们方法的普遍适用性。我们计划发布我们的代码和数据，以确保可复现性并促进进一步的研究。