We study the problem of monitoring distributed systems where computers communicate using message passing and share an almost synchronized clock. This is a realistic scenario for networks where the speed of the monitoring is sufficiently slow (at the human scale) to permit efficient clock synchronization, where the clock deviations is small compared to the monitoring cycles. This is the case when monitoring human systems in wide area networks, the Internet or including large deployments. More concretely, we study how to monitor decentralized systems where monitors are expressed as stream runtime verification specifications, under a timed asynchronous network. Our monitors communicate using the network, where messages can take arbitrarily long but cannot be duplicated or lost. This communication setting is common in many cyber-physical systems like smart buildings and ambient living. Previous approaches to decentralized monitoring were limited to synchronous networks, which are not easily implemented in practice because of network failures. Even when networks failures are unusual, they can require several monitoring cycles to be repaired. In this work we propose a solution to the timed asynchronous monitoring problem and show that this problem generalizes the synchronous case. We study the specifications and conditions on the network behavior that allow the monitoring to take place with bounded resources, independently of the trace length. Finally, we report the results of an empirical evaluation of an implementation and verify the theoretical results in terms of effectiveness and efficiency.
翻译:我们研究了分布式系统的监控问题,其中计算机通过消息传递进行通信,并共享几乎同步的时钟。这是一个现实的网络场景:监控速度足够慢(在人类时间尺度上),使得时钟同步有效,且时钟偏差相对于监控周期较小。这适用于在广域网、互联网或大规模部署中监控人类系统的情形。更具体地说,我们研究了如何在定时异步网络下,对以流运行时验证规范形式表达的分散式系统进行监控。我们的监控器通过该网络进行通信,其中消息传输时间可以任意长,但不可重复或丢失。这种通信设置常见于智能建筑和智能环境等许多网络物理系统中。此前分散式监控的方法局限于同步网络,而由于网络故障,同步网络在实践中难以实现。即使网络故障不常见,也可能需要多个监控周期才能修复。本文提出了一种解决定时异步监控问题的方法,并表明该问题泛化了同步情形。我们研究了允许在有限资源下(独立于迹长度)进行监控的规范条件和网络行为条件。最后,我们报告了实现的实证评估结果,并从有效性和效率方面验证了理论结果。