Autonomous vehicle (AV) perception models are typically evaluated solely on benchmark performance metrics, with limited attention to code quality, production readiness and long-term maintainability. This creates a significant gap between research excellence and real-world deployment in safety-critical systems subject to international safety standards. To address this gap, we present the first large-scale empirical study of software quality in AV perception repositories, systematically analyzing 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards. Using static analysis tools (Pylint, Bandit, and Radon), we evaluated code errors, security vulnerabilities, maintainability, and development practices. Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria, defined as having zero critical errors and no high-severity security vulnerabilities. Security issues are highly concentrated, with the top five issues responsible for almost 80% of occurrences, which prompted us to develop a set of actionable guidelines to prevent them. Additionally, the adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability. Our findings highlight that leaderboard performance does not reflect production readiness and that targeted interventions could substantially improve the quality and safety of AV perception code.
翻译:自动驾驶(AV)感知模型通常仅基于基准性能指标进行评估,对代码质量、生产就绪度和长期可维护性的关注有限。这导致在受国际安全标准约束的安全关键系统中,研究卓越性与实际部署之间存在显著差距。为弥补这一差距,我们首次对自动驾驶感知仓库的软件质量进行了大规模实证研究,系统分析了来自KITTI和NuScenes 3D物体检测排行榜的178个独特模型。通过使用静态分析工具(Pylint、Bandit和Radon),我们评估了代码错误、安全漏洞、可维护性及开发实践。研究发现,仅有7.3%的被研究仓库满足基本的生产就绪标准(定义为无关键错误且无高严重性安全漏洞)。安全问题高度集中,前五大问题几乎导致了80%的发生率,这促使我们制定了一套可操作的指导原则以预防这些问题。此外,持续集成/持续部署管道的采用与更好的代码可维护性相关。我们的研究结果强调,排行榜性能并不能反映生产就绪度,且有针对性的干预措施可显著提升自动驾驶感知代码的质量与安全性。