In this work, we focus on the problem of replay clocks (RepCL). The need for replay clocks arises from the observation that analyzing distributed computation for all desired properties of interest may not be feasible in an online environment. These properties can be analyzed by replaying the computation. However, to be beneficial, such replay must account for all the uncertainty that is possible in a distributed computation. Specifically, if event 'e' must occur before 'f' then the replay clock must ensure that 'e' is replayed before 'f'. On the other hand, if 'e' and 'f' could occur in any order then replay should not force an order between them. After identifying the limitations of existing clocks to provide the replay primitive, we present RepCL and identify an efficient representation for the same. We demonstrate that RepCL can be implemented with less than four integers for 64 processes for various system parameters if clocks are synchronized within 1 ms. Furthermore, the overhead of RepCL (for computing/comparing timestamps and message size) is proportional to the size of the clock. Using simulations, we identify the expected overhead of RepCL based on the given system settings. We also identify how a user can the identify feasibility region for RepCL. Specifically, given the desired overhead of RepCL, it identifies the region where unabridged replay is possible.
翻译:本文聚焦于回放时钟(RepCL)问题。回放时钟的需求源于以下观察:在在线环境中,对分布式计算所需的所有关注属性进行分析可能不可行。这些属性可以通过重放计算过程来分析。然而,要实现有效分析,这种重放必须考虑分布式计算中所有可能的不确定性。具体而言,如果事件'e'必须发生在'f'之前,则回放时钟必须确保'e'在'f'之前被重放;反之,若'e'与'f'可能以任意顺序发生,则重放不应强制两者间的顺序关系。在指出现有时钟作为回放基元的局限性后,我们提出RepCL并为其设计了一种高效表示。研究表明,若时钟同步精度在1毫秒内,对于64进程的各类系统参数,RepCL可用少于四个整数实现。此外,RepCL的开销(包括时间戳计算/比较与消息大小)与时钟规模成正比。通过仿真,我们根据给定系统参数确定了RepCL的预期开销,并进一步阐明了用户如何界定RepCL的可行性区域——即给定期望开销时,可实现无删减重放的参数范围。