Distributed tracing has become a fundamental tool for diagnosing performance issues in the cloud by recording causally ordered, end-to-end workflows of request executions. However, tracing in production workloads can introduce significant overheads due to the extensive instrumentation needed for identifying performance variations. This paper addresses the trade-off between the cost of tracing and the utility of the "spans" within that trace through Astraea, an online probabilistic distributed tracing system. Astraea is based on our technique that combines online Bayesian learning and multi-armed bandit frameworks. This formulation enables Astraea to effectively steer tracing towards the useful instrumentation needed for accurate performance diagnosis. Astraea localizes performance variations using only 10-28% of available instrumentation, markedly reducing tracing overhead, storage, compute costs, and trace analysis time.
翻译:分布式追踪已成为诊断云端性能问题的基本工具,它通过记录请求执行过程中因果有序的端到端工作流来实现这一功能。然而,在生产负载中进行追踪会因识别性能变化所需的大量插装而引入显著开销。本文通过Astraea——一种在线概率分布式追踪系统——解决了追踪成本与追踪内“跨度”效用之间的权衡问题。Astraea基于我们提出的技术,该技术结合了在线贝叶斯学习与多臂老虎机框架。这一方法使Astraea能够有效地将追踪引导至准确性能诊断所需的有用插装。Astraea仅使用10-28%的可用插装即可定位性能变化,显著降低了追踪开销、存储与计算成本以及追踪分析时间。