Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools cannot detect. Several research groups have built ML-specific linters, demonstrating that detection is feasible. Yet these tools share a sustainability problem: dependency on specific pylint or Python versions, limited packaging, and reliance on manual engineering for every new pattern. As AI-generated code increases the volume of scientific software, the need for automated methodology checking (such as detecting data leakage, incorrect cross-validation, and missing random seeds) grows. We present scicode-lint, whose two-tier architecture separates pattern design (frontier models at build time) from execution (small local model at runtime). Patterns are generated, not hand-coded; adapting to new library versions costs tokens, not engineering hours. On Kaggle notebooks with human-labeled ground truth, preprocessing leakage detection reaches 65% precision at 100% recall; on 38 published scientific papers applying AI/ML, precision is 62% (LLM-judged) with substantial variation across pattern categories; on a held-out paper set, precision is 54%. On controlled tests, scicode-lint achieves 97.7% accuracy across 66 patterns.
翻译:科学Python代码中的方法论错误会产生看似合理但实际错误的结果,传统linter和静态分析工具无法检测此类错误。多个研究团队已构建了专门针对机器学习的linter,证明此类检测具有可行性。但这些工具普遍存在可持续性问题:依赖特定pylint或Python版本、封装性有限、且每个新模式均需人工编码实现。随着AI生成代码导致科学软件规模激增,对自动化方法论检查(如检测数据泄露、错误交叉验证、缺失随机种子等)的需求日益增长。我们提出scicode-lint,其双层架构将模式设计(构建时采用前沿模型)与执行(运行时使用轻量本地模型)相分离。模式由系统自动生成而非人工编码;适配新库版本仅需消耗计算资源而非工程人力。在带人工标注的Kaggle笔记本测试中,预处理泄露检测在100%召回率下达到65%精确率;在38篇应用AI/ML的已发表科学论文中,精确率为62%(基于大语言模型评估),不同模式类别间差异显著;在留出论文集中精确率为54%。在受控测试中,scicode-lint在66个模式上实现了97.7%的准确率。