Applications requiring real-time processing of large volumes of data have been the main driver for rethinking the traditional cloud, giving rise to novel cloud models. Distributed cloud (DC) is a model that allows users to dynamically create and dispose of strategically located ad-hoc clouds that contain resources best tailored to their needs. It is essential for this model to provide a high degree of observability for it to be viable in real-world scenarios. In this paper, we present the design and implementation of a monitoring system that collects metrics from DCs and makes them accessible to diverse clients. Agents running on nodes are responsible for collecting machine-, container-, and application-level metrics. During the health-check protocol, that data is transferred from the node to the DC's control plane running inside the cloud. There, it is persisted and served via multiple APIs, including a streaming API. Moreover, node metrics are aggregated for every DC in order to provide a more comprehensive view of the system's state.
翻译:需要实时处理海量数据的应用已成为重新思考传统云架构的主要驱动力,催生了新型云模型。分布式云是一种允许用户动态创建和处置战略性分布的临时云(ad-hoc clouds)的模型,这些云包含最贴合用户需求的资源。该模型必须提供高度可观测性,方能在实际场景中具备可行性。本文提出一种监控系统的设计与实现,该系统能够从分布式云中采集指标数据,并为多样化客户端提供访问接口。运行于节点上的代理程序负责采集机器级、容器级和应用级指标。在健康检查协议执行期间,数据从节点传输至运行在云内的分布式云控制平面。数据在控制平面持久化存储,并通过包括流式API在内的多种接口对外提供服务。此外,系统会对每个分布式云的所有节点指标进行聚合,以提供更全面的系统状态视图。