Scaling Data Center TCP to Terabits with Laminar

We present Laminar, the first TCP stack that delivers ASIC-class performance and energy efficiency on programmable Reconfigurable Match-Action Table (RMT) pipelines, providing flexibility while retaining standard TCP semantics and POSIX socket compatibility. The key challenge to Laminar is reconciling TCP's complex dependent state updates with RMT's unidirectional, lock-step execution model. To overcome this challenge, Laminar introduces three novel techniques: optimistic concurrency (speculative updates validated downstream), pseudo-segment injection (circular dependency resolution without stalls), and bump-in-the-wire processing (single-pass segment handling). Together, these enable TCP processing, including retransmission, reassembly, flow, and congestion control, as a pipeline of simple match-action operations. Our Intel Tofino 2 prototype demonstrates Laminar's scalability to terabit speeds, flexibility, and robustness to network dynamics. Laminar matches RDMA performance and efficiency for both RPC and streaming workloads (including NVMe-oF with SPDK), while maintaining TCP/POSIX compatibility. Laminar saves up to 16 host CPU cores versus state-of-the-art kernel-bypass TCP, while achieving 5$\times$ lower 99.99p tail latency and 2$\times$ better throughput-per-watt for key-value stores. At scale, Laminar drives nearly $1$ Bpps at 20 $μ$s RPC tail latency. Unlike fixed-function offloads, Laminar supports transport evolution through in-data-path extensions (selective ACKs, congestion control variants, application co-design for shared logs). Finally, Laminar generalizes to FPGA SmartNICs, outperforming ToNIC's monolithic design by $3\times$ under equal timing.

翻译：本文提出Laminar——首个在可编程重构匹配动作表（RMT）流水线上实现ASIC级性能与能效的TCP协议栈，在保持标准TCP语义与POSIX套接字兼容性的同时提供灵活性。Laminar面临的核心挑战在于如何协调TCP复杂的依赖状态更新与RMT单向锁步执行模型之间的矛盾。为此，Laminar引入三项创新技术：乐观并发（通过下游验证的推测更新）、伪段注入（无停顿的循环依赖解析）以及线缆旁路处理（单次数据段处理）。这些技术共同将包括重传、重组、流量控制与拥塞控制在内的TCP处理过程转化为简单匹配-动作操作的流水线。基于英特尔Tofino 2的原型系统验证了Laminar在太比特速率下的可扩展性、灵活性及对网络动态的鲁棒性。Laminar在RPC与流式工作负载（包括采用SPDK的NVMe-oF）中均达到与RDMA相当的效能，同时保持TCP/POSIX兼容性。相较于最先进的内核旁路TCP方案，Laminar可节省多达16个主机CPU核心，在键值存储场景中实现99.99百分位尾延迟降低5倍、每瓦吞吐量提升2倍。大规模部署时，Laminar可在20微秒RPC尾延迟下驱动近10亿数据包/秒的吞吐。与固定功能卸载方案不同，Laminar支持通过数据路径内扩展实现传输协议演进（如选择性确认、拥塞控制变体、共享日志的协同应用设计）。最后，Laminar可泛化至FPGA智能网卡，在相同时序约束下性能超越ToNIC单体设计达3倍。