Distributed filesystem metadata updates are typically synchronous. This creates inherent challenges for access efficiency, load balancing, and directory contention, especially under dynamic and skewed workloads. This paper argues that synchronous updates are overly conservative. We propose SwitchFS with asynchronous metadata updates that allow operations to return early and defer directory updates until reads, both hiding latency and amortizing overhead. The key challenge lies in efficiently maintaining the synchronous POSIX semantics of metadata updates. To address this, SwitchFS is co-designed with a programmable switch, leveraging the limited on-switch resources to track directory states with negligible overhead. This allows SwitchFS to aggregate and apply delayed updates efficiently, using batching and consolidation before directory reads. Evaluation shows that SwitchFS achieves up to 13.34$\times$ and 3.85$\times$ higher throughput, and 61.6% and 57.3% lower latency than two state-of-the-art distributed filesystems, Emulated-InfiniFS and Emulated-CFS, respectively, under skewed workloads. For real-world workloads, SwitchFS improves end-to-end throughput by 21.1$\times$, 1.1$\times$, and 0.3$\times$ over CephFS, Emulated-InfiniFS, and Emulated-CFS, respectively.
翻译:分布式文件系统的元数据更新通常是同步的。这给访问效率、负载均衡和目录争用带来了固有挑战,尤其是在动态且倾斜的工作负载下。本文认为同步更新过于保守。我们提出了SwitchFS,它采用异步元数据更新机制,允许操作提前返回并将目录更新延迟到读取时执行,从而既隐藏了延迟又分摊了开销。关键挑战在于如何高效地维持元数据更新的同步POSIX语义。为此,SwitchFS与可编程交换机协同设计,利用交换机上有限的资源以可忽略的开销跟踪目录状态。这使得SwitchFS能够在目录读取前,通过批处理和合并操作,高效地聚合并应用延迟的更新。评估表明,在倾斜工作负载下,与两种先进的分布式文件系统(Emulated-InfiniFS和Emulated-CFS)相比,SwitchFS分别实现了高达13.34倍和3.85倍的吞吐量提升,以及61.6%和57.3%的延迟降低。对于实际工作负载,SwitchFS相较于CephFS、Emulated-InfiniFS和Emulated-CFS,端到端吞吐量分别提升了21.1倍、1.1倍和0.3倍。