Serverless becomes popular as a novel computing paradigms for cloud native services. However, the complexity and dynamic nature of serverless applications present significant challenges to ensure system availability and performance. There are many root cause analysis (RCA) methods for microservice systems, but they are not suitable for precise modeling serverless applications. This is because: (1) Compared to microservice, serverless applications exhibit a highly dynamic nature. They have short lifecycle and only generate instantaneous pulse-like data, lacking long-term continuous information. (2) Existing methods solely focus on analyzing the running stage and overlook other stages, failing to encompass the entire lifecycle of serverless applications. To address these limitations, we propose FaaSRCA, a full lifecycle root cause analysis method for serverless applications. It integrates multi-modal observability data generated from platform and application side by using Global Call Graph. We train a Graph Attention Network (GAT) based graph auto-encoder to compute reconstruction scores for the nodes in global call graph. Based on the scores, we determine the root cause at the granularity of the lifecycle stage of serverless functions. We conduct experimental evaluations on two serverless benchmarks, the results show that FaaSRCA outperforms other baseline methods with a top-k precision improvement ranging from 21.25% to 81.63%.
翻译:无服务器作为一种新型的云计算范式,在云原生服务中日益普及。然而,无服务器应用的复杂性和动态特性给确保系统可用性与性能带来了显著挑战。现有针对微服务系统的根因分析方法众多,但均不适用于精确建模无服务器应用。其原因在于:(1)相较于微服务,无服务器应用具有高度动态性,其生命周期短暂且仅产生瞬时脉冲式数据,缺乏长期连续信息;(2)现有方法仅聚焦于运行阶段分析,忽视了其他阶段,未能覆盖无服务器应用的全生命周期。为突破这些局限,本文提出FaaSRCA——一种面向无服务器应用的全生命周期根因分析方法。该方法通过全局调用图整合平台侧与应用侧生成的多模态可观测性数据,并训练基于图注意力网络的图自编码器以计算全局调用图中节点的重构分数。基于这些分数,我们能在无服务器函数生命周期阶段的粒度上定位根因。我们在两个无服务器基准测试上进行了实验评估,结果表明FaaSRCA在top-k准确率上较其他基线方法提升21.25%至81.63%。