Chaos Engineering (CE) has emerged as a proactive method to improve the resilience of modern distributed systems, particularly within DevOps environments. Originally pioneered by Netflix, CE simulates real-world failures to expose weaknesses before they impact production. In this paper, we present a systematic gray literature review that investigates how industry practitioners have adopted and adapted CE principles over recent years. Analyzing 50 sources published between 2019 and early 2024, we developed a comprehensive classification framework that extends the foundational CE principles into ten distinct concepts. Our study reveals that while the core tenets of CE remain influential, practitioners increasingly emphasize controlled experimentation, automation, and risk mitigation strategies to align with the demands of agile and continuously evolving DevOps pipelines. Our results enhance the understanding of how CE is intended and implemented in practice, and offer guidance for future research and industrial applications aimed at improving system robustness in dynamic production environments.
翻译:混沌工程作为一种主动方法,已发展成为提升现代分布式系统(尤其在DevOps环境中)韧性的重要手段。该方法最初由Netflix开创,通过模拟真实世界故障以在生产环境受影响前暴露系统弱点。本文通过系统性灰色文献综述,探究了近年来行业实践者如何采纳并调整混沌工程原则。通过分析2019年至2024年初发表的50篇文献,我们构建了一个扩展基础混沌工程原则的完整分类框架,将其归纳为十个独立概念。研究表明,尽管混沌工程的核心原则仍具影响力,实践者日益强调受控实验、自动化及风险缓解策略,以契合敏捷且持续演进的DevOps流水线需求。本研究深化了对混沌工程实践理念与实施方式的理解,并为未来旨在提升动态生产环境系统鲁棒性的研究及工业应用提供了指导。