Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools cannot detect. Several research groups have built ML-specific linters, demonstrating that detection is feasible. Yet these tools share a sustainability problem: dependency on specific pylint or Python versions, limited packaging, and reliance on manual engineering for every new pattern. As AI-generated code increases the volume of scientific software, the need for automated methodology checking (such as detecting data leakage, incorrect cross-validation, and missing random seeds) grows. We present scicode-lint, whose two-tier architecture separates pattern design (frontier models at build time) from execution (small local model at runtime). Patterns are generated, not hand-coded; adapting to new library versions costs tokens, not engineering hours. On Kaggle notebooks with human-labeled ground truth, preprocessing leakage detection reaches 65% precision at 100% recall; on 38 published scientific papers applying AI/ML, precision is 62% (LLM-judged) with substantial variation across pattern categories; on a held-out paper set, precision is 54%. On controlled tests, scicode-lint achieves 97.7% accuracy across 66 patterns.
翻译:科学Python代码中的方法论错误会产生看似合理但实际错误的计算结果,而传统代码检查工具和静态分析工具无法检测此类错误。多个研究团队已构建了机器学习专用代码检查器,验证了检测可行性。但这些工具存在可持续性问题:依赖特定pylint或Python版本、封装有限、每个新模式需依赖人工工程。随着AI生成代码增加科学软件数量,对自动化方法论检查(如检测数据泄露、错误交叉验证、缺失随机种子)的需求日益增长。本文提出scicode-lint,其双层架构将模式设计(构建阶段的前沿模型)与执行(运行时的小型本地模型)分离。模式通过生成而非手工编码实现;适应新库版本仅需消耗计算资源而非工程工时。在带有人工标注真实标签的Kaggle笔记本上,预处理泄露检测在100%召回率下达到65%精度;在38篇应用AI/ML的已发表科学论文中,精度达62%(经大模型评估),不同模式类别差异显著;在保留论文集上精度为54%。在受控测试中,scicode-lint在66个模式上实现97.7%准确率。