As modern microservice systems grow increasingly popular and complex-often consisting of hundreds or even thousands of fine-grained, interdependent components-they are becoming more susceptible to frequent and subtle failures. Ensuring system reliability therefore hinges on accurate and efficient failure localization. Traditional failure localization approaches based on small models lack the flexibility to adapt to diverse failure scenarios, while recent LLM-based methods suffer from two major limitations: they often rely on rigid invocation workflows that constrain the model's ability to dynamically explore optimal localization paths, and they require resource-intensive inference, making them cost-prohibitive for real-world deployment. To address these challenges, we explore the use of reinforcement fine-tuning to equip lightweight LLMs with reasoning and self-refinement capabilities, significantly improving the cost-effectiveness and adaptability of LLM-based failure localization. We begin with an empirical study to identify three key capabilities essential for accurate localization. Building on these insights, we propose a progressive multi-stage GRPO fine-tuning framework, which integrates a multi-factor failure localization grader and a recursion-of-thought actor module. The resulting model, ThinkFL, not only outperforms existing state-of-the-art LLMs and baseline methods in localization accuracy but also reduces end-to-end localization latency from minutes to seconds, demonstrating strong potential for real-world applications.
翻译:随着现代微服务系统日益普及且复杂化——通常由数百甚至数千个细粒度、相互依赖的组件构成——它们更容易出现频繁且隐蔽的故障。因此,确保系统可靠性取决于准确高效的故障定位。基于小型模型的传统故障定位方法缺乏适应多样化故障场景的灵活性,而近期基于大语言模型的方法存在两大主要局限:它们通常依赖僵化的调用工作流,限制了模型动态探索最优定位路径的能力;并且需要资源密集的推理过程,导致实际部署成本过高。为应对这些挑战,我们探索利用强化微调技术,使轻量化大语言模型具备推理与自优化能力,从而显著提升基于大语言模型的故障定位方法的成本效益与适应性。我们首先通过实证研究确定了实现精准定位所需的三项关键能力。基于这些发现,我们提出了一种渐进式多阶段GRPO微调框架,该框架整合了多因子故障定位评估器与递归思维执行器模块。由此构建的模型ThinkFL不仅在定位准确率上超越了现有最先进的大语言模型及基线方法,还将端到端定位延迟从分钟级缩短至秒级,展现出强大的实际应用潜力。