Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that show promising precision (96%) and low false positive rate (0.08%), though recall (58%) remains limited under conservative propagation assumptions. This framework demonstrates practical utility within Prime Video while also surfacing methodological lessons and directions, providing a foundation for broader application across microservice ecosystems.
翻译:Prime Video定期进行负载测试,以模拟直播活动(如周四橄榄球之夜)以及视频点播活动(如《力量之戒》)期间观众流量激增的情况。虽然这些压力测试验证了系统容量,但有时可能遗漏真实事件流量中独有的服务行为。我们提出了一种基于图的异常检测系统,通过无监督的节点级图嵌入来识别代表性不足的服务。该系统基于GCN-GAE架构,以分钟级分辨率从有向加权服务图中学习结构表示,并根据负载测试与事件嵌入之间的余弦相似度标记异常。该系统能够识别与已记录事件相关的服务,并展现出早期检测能力。我们还引入了一个初步的合成异常注入框架用于受控评估,该框架在精确率(96%)和低假阳性率(0.08%)方面表现良好,但在保守传播假设下召回率(58%)仍有限。该框架在Prime Video中展示了实际效用,同时揭示了方法论经验与研究方向,为在更广泛的微服务生态系统中应用奠定了基础。