We report on Just-in-Time catching test generation at Meta, designed to prevent bugs in large scale backend systems of hundreds of millions of line of code. Unlike traditional hardening tests, which pass at generation time, catching tests are meant to fail, surfacing bugs before code lands. The primary challenge is to reduce development drag from false positive test failures. Analyzing 22,126 generated tests, we show code-change-aware methods improve candidate catch generation 4x over hardening tests and 20x over coincidentally failing tests. To address false positives, we use rule-based and LLM-based assessors. These assessors reduce human review load by 70%. Inferential statistical analysis showed that human-accepted code changes are assessed to have significantly more false positives, while human-rejected changes have significantly more true positives. We reported 41 candidate catches to engineers; 8 were confirmed to be true positives, 4 of which would have led to serious failures had they remained uncaught. Overall, our results show that Just-in-Time catching is scalable, industrially applicable, and that it prevents serious failures from reaching production.
翻译:本文报告了Meta公司采用的即时捕获测试生成方法,该方法旨在防止拥有数亿行代码的大规模后端系统出现缺陷。与在生成时即通过的传统强化测试不同,捕获测试旨在失败,从而在代码落地前暴露缺陷。主要挑战在于减少因误报测试失败而带来的开发阻力。通过对22,126个生成测试的分析,我们表明,相对于强化测试,具备代码变更感知能力的方法将候选捕获生成效率提升了4倍;相对于偶然失败的测试,则提升了20倍。为应对误报问题,我们采用了基于规则和基于大语言模型(LLM)的评估器。这些评估器将人工审核工作量减少了70%。推断统计分析表明,人工接受的代码变更被评估出显著更多的误报,而人工拒绝的变更则包含显著更多的真阳性。我们向工程师报告了41个候选捕获;其中8个被确认为真阳性,若未被捕获,其中4个将导致严重故障。总体而言,我们的结果表明,即时捕获测试方法具有可扩展性、工业适用性,并能防止严重故障进入生产环境。