Cloudprofiler: TSC-based inter-node profiling and high-throughput data ingestion for cloud streaming workloads

To conduct real-time analytics computations, big data stream processing engines are required to process unbounded data streams at millions of events per second. However, current streaming engines exhibit low throughput and high tuple processing latency. Performance engineering is complicated by the fact that streaming engines constitute complex distributed systems consisting of multiple nodes in the cloud. A profiling technique is required that is capable of measuring time durations at high accuracy across nodes. Standard clock synchronization techniques such as the network time protocol (NTP) are limited to millisecond accuracy, and hence cannot be used. We propose a profiling technique that relates the time-stamp counters (TSCs) of nodes to measure the duration of events in a streaming framework. The precision of the TSC relation determines the accuracy of the measured duration. The TSC relation is conducted in quiescent periods of the network to achieve accuracy in the tens of microseconds. We propose a throughput-controlled data generator to reliably determine the sustainable throughput of a streaming engine. To facilitate high-throughput data ingestion, we propose a concurrent object factory that moves the deserialization overhead of incoming data tuples off the critical path of the streaming framework. The evaluation of the proposed techniques within the Apache Storm streaming framework on the Google Compute Engine public cloud shows that data ingestion increases from $700$ $\text{k}$ to $4.68$ $\text{M}$ tuples per second, and that time durations can be profiled at a measurement accuracy of $92$ $\mu\text{s}$, which is three orders of magnitude higher than the accuracy of NTP, and one order of magnitude higher than prior work.

翻译：为了实现实时分析计算，大数据流处理引擎需要以每秒数百万事件的速度处理无界数据流。然而，当前流处理引擎存在吞吐量低、元组处理延迟高的问题。流处理引擎作为由云中多个节点组成的复杂分布式系统，使得性能工程变得复杂。因此，需要一种能够在节点间以高精度测量时间间隔的性能分析技术。标准时钟同步技术（如网络时间协议NTP）受限于毫秒级精度，因此无法使用。我们提出了一种性能分析技术，通过关联节点的时间戳计数器（TSC）来测量流处理框架中事件的持续时间。TSC关联的精度决定了测量持续时间的准确性。该TSC关联在网络静默期进行，以实现数十微秒级的精度。我们还提出了一种吞吐量受控的数据生成器，用于可靠地确定流处理引擎的可持续吞吐量。为促进高吞吐数据注入，我们提出了一种并发对象工厂，将输入数据元组的反序列化开销移出流处理框架的关键路径。在Google Compute Engine公有云上的Apache Storm流处理框架中评估所提技术的结果显示，数据注入从每秒$700$千个元组提升至$4.68$百万个元组，时间间隔测量精度可达$92$微秒，比NTP精度高三个数量级，并比先前工作高一个数量级。