We study the problem of monitoring distributed systems where computers communicate using message passing and share an almost synchronized clock. This is a realistic scenario for networks where the speed of the monitoring is sufficiently slow (at the human scale) to permit efficient clock synchronization, where the clock deviations is small compared to the monitoring cycles. This is the case when monitoring human systems in wide area networks, the Internet or including large deployments. More concretely, we study how to monitor decentralized systems where monitors are expressed as stream runtime verification specifications, under a timed asynchronous network. Our monitors communicate using the network, where messages can take arbitrarily long but cannot be duplicated or lost. This communication setting is common in many cyber-physical systems like smart buildings and ambient living. Previous approaches to decentralized monitoring were limited to synchronous networks, which are not easily implemented in practice because of network failures. Even when networks failures are unusual, they can require several monitoring cycles to be repaired. In this work we propose a solution to the timed asynchronous monitoring problem and show that this problem generalizes the synchronous case. We study the specifications and conditions on the network behavior that allow the monitoring to take place with bounded resources, independently of the trace length. Finally, we report the results of an empirical evaluation of an implementation and verify the theoretical results in terms of effectiveness and efficiency.
翻译:我们研究了分布式系统的监测问题,其中计算机通过消息传递进行通信并共享近似同步的时钟。这是网络中的实际场景:监测速度足够慢(以人类时间尺度衡量),使得时钟同步能够高效实现,且时钟偏差相对于监测周期而言较小。这种场景在广域网、互联网或大规模部署中监测人类系统时普遍存在。具体而言,我们研究了如何在定时异步网络环境下,对以流运行时验证规约形式表述的去中心化系统进行监测。监测器通过该网络通信,消息传输可能任意延迟,但不可重复或丢失。这种通信模式常见于智能建筑、环境辅助生活等众多信息物理系统中。以往的去中心化监测方法局限于同步网络,而同步网络因网络故障在实践中难以实现——即便网络故障不频繁,也可能需要多个监测周期才能修复。本文提出了一种解决定时异步监测问题的方法,并证明该问题是同步监测问题的泛化形式。我们研究了允许在有限资源下(与跟踪长度无关)进行监测的网络行为规约与条件。最后,报告了实现的实证评估结果,并从有效性和效率方面验证了理论结论。