Causality in distributed systems is a concept that has long been explored and numerous approaches have been made to use causality as a way to trace distributed system execution. Traditional approaches usually used system profiling and newer approaches profiled clocks of systems to detect failures and construct timelines that caused those failures. Since the advent of logical clocks, these profiles have become more and more accurate with ways to characterize concurrency and distributions, with accurate diagrams for message passing. Vector clocks addressed the shortcomings of using traditional logical clocks, by storing information about other processes in the system as well. Hybrid vector clocks are a novel approach to this concept where clocks need not store all the process information. Rather, we store information of processes within an acceptable skew of the focused process. This gives us an efficient way of profiling with substantially reduced costs to the system. Building on this idea, we propose the idea of building causal traces using information generated from the hybrid vector clock. The hybrid vector clock would provide us with a strong sense of concurrency and distribution, and we theorize that all the information generated from the clock is sufficient to develop a causal trace for debugging. We post-process and parse the clocks generated from an execution trace to develop a swimlane on a web interface, that traces the points of failure of a distributed system. We also provide an API to reuse this concept for any generic distributed system framework.
翻译:分布式系统中的因果关系是一个被长期探索的概念,已有众多方法尝试利用因果关系来追踪分布式系统的执行过程。传统方法通常采用系统剖析技术,而较新方法则通过剖析系统时钟来检测故障并构建导致故障的时间线。自逻辑时钟出现以来,这些剖析技术通过精确的消息传递图,在表征并发性与分布式特征方面日益精准。向量时钟通过同时存储系统中其他进程的信息,解决了传统逻辑时钟的不足。混合向量时钟是这一概念的新型实现方式,其时钟无需存储所有进程信息,而是仅存储距离目标进程可接受偏差范围内的进程信息。这提供了一种显著降低系统开销的高效剖析方法。基于此思想,我们提出利用混合向量时钟生成的信息构建因果追踪链条的方案。混合向量时钟能提供对并发性和分布式特征的强效感知,我们理论论证了该时钟生成的所有信息足以构建用于调试的因果追踪。通过对执行轨迹产生的时钟进行后处理与解析,我们在网络界面上开发出泳道图,用以追踪分布式系统的故障点。同时,我们提供一套应用程序接口,使该方案可复用于任何通用分布式系统框架。