Retrofitting Service Dependency Discovery in Distributed Systems

Modern distributed systems rely on complex networks of interconnected services, creating direct or indirect dependencies that can propagate faults and cause cascading failures. To localize the root cause of performance degradation in these environments, constructing a service dependency graph is highly beneficial. However, building an accurate service dependency graph is impaired by complex routing techniques, such as Network Address Translation (NAT), an essential mechanism for connecting services across networks. NAT obfuscates the actual hosts running the services, causing existing run-time approaches that passively observe network metadata to fail in accurately inferring service dependencies. To this end, this paper introduces XXXX, a novel run-time system for constructing process-level service dependency graphs. It operates without source code instrumentation and remains resilient under complex network routing mechanisms, including NAT. XXXX implements a non-disruptive method of injecting metadata onto a TCP packet's header that maintains protocol correctness across host boundaries. In other words, if no receiving agent is present, the instrumentation leaves existing TCP connections unaffected, ensuring non-disruptive operation when it is partially deployed across hosts. We evaluated XXXX extensively against three state-of-the-art systems across nine scenarios, involving three network configurations (NAT-free, internal-NAT, external-NAT) and three microservice benchmarks. XXXX was the only approach that performed consistently across networking configurations. With regards to correctness, it performed on par with, or better than, the state-of-the-art with precision and recall values of 100% in the majority of the scenarios.

翻译：现代分布式系统依赖于复杂的互联服务网络，形成了直接或间接的依赖关系，这些依赖可能传播故障并导致级联失效。为定位此类环境中性能下降的根本原因，构建服务依赖图具有显著价值。然而，复杂路由技术（如网络地址转换（NAT））阻碍了精确服务依赖图的构建。NAT作为跨网络连接服务的关键机制，会模糊运行服务的实际主机信息，导致现有基于网络元数据被动观测的运行时方法无法准确推断服务依赖关系。为此，本文提出XXXX——一种创新的运行时系统，用于构建进程级服务依赖图。该系统无需源代码插装，且在包括NAT在内的复杂网络路由机制下保持鲁棒性。XXXX通过在TCP报文头部注入元数据的非侵入式方法，确保跨主机边界的协议正确性。换言之，当接收端无代理存在时，该插装技术不会影响现有TCP连接，从而在跨主机部分部署时保证非破坏性运行。我们在九种场景下对XXXX与三种前沿系统进行了全面评估，涵盖三种网络配置（无NAT、内部NAT、外部NAT）和三种微服务基准测试。XXXX是唯一能在所有网络配置中保持稳定性能的方法。在正确性方面，其在多数场景中达到或优于当前最优水平，精确率与召回率均达100%。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日