When physical testbeds are out of reach for evaluating a networked system, we frequently turn to simulation. In today's datacenter networks, bottlenecks are rarely at the network protocol level, but instead in end-host software or hardware components, thus current protocol-level simulations are inadequate means of evaluation. End-to-end simulations covering these components on the other hand, simply cannot achieve the required scale with feasible simulation performance and computational resources. In this paper, we address this with SplitSim, a simulation framework for end-to-end evaluation for large-scale network and distributed systems. To this end, SplitSim builds on prior work on modular end-to-end simulations and combines this with key elements to achieve scalability. First, mixed fidelity simulations judiciously reduce detail in simulation of parts of the system where this can be tolerated, while retaining the necessary detail elsewhere. SplitSim then parallelizes bottleneck simulators by decomposing them into multiple parallel but synchronized processes. Next, SplitSim provides a profiler to help users understand simulation performance and where the bottlenecks are, so users can adjust the configuration. Finally SplitSim provides abstractions to make it easy for users to build complex large-scale simulations. Our evaluation demonstrates SplitSim in multiple large-scale case studies.
翻译:当物理测试平台无法用于评估网络系统时,我们常求助于模拟。在当今数据中心网络中,瓶颈很少出现在网络协议层面,而更多地出现在终端主机软件或硬件组件中,因此目前的协议级模拟不足以作为评估手段。另一方面,覆盖这些组件的端到端模拟在可行的模拟性能和计算资源条件下根本无法达到所需的规模。本文通过SplitSim解决了这一问题,它是一种用于大规模网络和分布式系统端到端评估的模拟框架。为此,SplitSim借鉴了先前在模块化端到端模拟方面的工作,并结合关键元素以实现可扩展性。首先,混合保真度模拟审慎地减少系统某些部分可容忍的模拟细节,同时在其他部分保留必要的细节。接着,SplitSim通过将瓶颈模拟器分解为多个并行但同步的进程来实现并行化。然后,SplitSim提供一个剖析器,帮助用户理解模拟性能及瓶颈所在,从而使用户能够调整配置。最后,SplitSim提供抽象机制,使用户能够轻松构建复杂的大规模模拟。我们的评估通过多个大规模案例研究展示了SplitSim的性能。