Laminar is the first TCP stack designed for the reconfigurable match-action table (RMT) architecture, widely used in high-speed programmable switches and SmartNICs. Laminar reimagines TCP processing as a pipeline of simple match-action operations, enabling line-rate performance with low latency and minimal energy consumption, while maintaining compatibility with standard TCP and POSIX sockets. Leveraging novel techniques like optimistic concurrency, pseudo segment updates, and bump-in-the-wire processing, Laminar handles the transport logic, including retransmission, reassembly, flow, and congestion control, entirely within the RMT pipeline. We prototype Laminar on an Intel Tofino2 switch, and demonstrate its scalability to terabit speeds, its flexibility, and robustness to network dynamics. Laminar delivers RDMA-equivalent performance, saving up to 16 host CPU cores versus the TAS kernel-bypass TCP stack with short RPC workloads, achieving 1.3$\times$ higher peak throughput at 5$\times$ lower 99.99p tail latency. At scale, Laminar drives nearly $1$Bpps of TCP processing while keeping RPC tail latency near $20\mu s$. For streaming workloads, Laminar achieves $25$Mpps per-core, enough to saturate the line-rate. It significantly benefits real applications: a key-value store on Laminar doubles throughput-per-watt while maintaining a 99.99p tail latency lower than TAS's best case tail latency, and SPDK's NVMe-oTCP reaches RDMA-level efficiency. Demonstrating Laminar's flexibility, we implement TCP stack extensions, including a sequencer API for a linearizable distributed shared log, Timely congestion control, and delayed ACKs. Finally, Laminar generalizes to FPGA SmartNICs, delivering $3\times$ ToNIC's packet rate under equal timing.
翻译:Laminar是首个专为可重构匹配-动作表(RMT)架构设计的TCP协议栈,该架构广泛应用于高速可编程交换机和智能网卡。Laminar将TCP处理重新构想为简单匹配-动作操作的流水线,在保持与标准TCP及POSIX套接字兼容的同时,实现了低延迟、低能耗的线速性能。通过采用乐观并发、伪段更新和线内处理等创新技术,Laminar将包括重传、重组、流量控制和拥塞控制在内的传输逻辑完全置于RMT流水线中处理。我们在Intel Tofino2交换机上对Laminar进行原型实现,并验证了其可扩展至太比特速率的能力、灵活性以及对网络动态的鲁棒性。Laminar可提供与RDMA相当的性能:在短RPC工作负载下,相比TAS内核旁路TCP协议栈可节省多达16个主机CPU核心,在99.99百分位尾延迟降低5倍的同时实现峰值吞吐量提升1.3倍。在大规模场景中,Laminar驱动近10亿包/秒的TCP处理能力,同时将RPC尾延迟维持在20微秒左右。对于流式工作负载,Laminar实现每核心2500万包/秒的处理速率,足以达到线速饱和。其实用价值显著:基于Laminar的键值存储系统在保持99.99百分位尾延迟低于TAS最佳尾延迟的同时,实现了每瓦吞吐量翻倍;SPDK的NVMe-oTCP方案达到RDMA级效率。为展示Laminar的灵活性,我们实现了多项TCP协议栈扩展功能,包括用于线性化分布式共享日志的定序器API、Timely拥塞控制算法及延迟确认机制。最后,Laminar可扩展至FPGA智能网卡平台,在相同时序条件下实现3倍于ToNIC的数据包处理速率。