In modern containerized cloud environments, the adoption of RDMA (Remote Direct Memory Access) has expanded to reduce CPU overhead and enable high-performance data exchange. Achieving this requires strong performance isolation to ensure that one container's RDMA workload does not degrade the performance of others, thereby maintaining critical security assurances. However, existing isolation techniques are difficult to apply effectively due to the complexity of microarchitectural resource management within RDMA NICs (RNICs). This paper experimentally analyzes two types of resource exhaustion attacks on NVIDIA BlueField-3: (i) state saturation attacks and (ii) pipeline saturation attacks. Our results show that state saturation attacks can cause up to a 93.9% loss in bandwidth, a 1,117x increase in latency, and a 115% rise in cache misses for victim containers, while pipeline saturation attacks lead to severe link-level congestion and significant amplification, where small verb requests result in disproportionately high resource consumption. To mitigate these threats and restore predictable security assurances, we propose HT-Verbs, a threshold-driven framework based on real-time per-container RDMA verb telemetry and adaptive resource classification that partitions RNIC resources into hot, warm, and cold tiers and throttles abusive workloads without requiring hardware modifications.
翻译:在现代容器化云环境中,远程直接内存访问(RDMA)技术的应用日益广泛,以降低CPU开销并实现高性能数据交换。这需要强大的性能隔离机制,确保单个容器的RDMA工作负载不会影响其他容器的性能,从而维持关键的安全保障。然而,由于RDMA网络接口卡(RNIC)内部微架构资源管理的复杂性,现有隔离技术难以有效实施。本文通过实验分析了针对NVIDIA BlueField-3的两种资源耗尽攻击:(1)状态饱和攻击与(2)流水线饱和攻击。实验结果表明,状态饱和攻击可导致受害容器的带宽损失高达93.9%,延迟增加1117倍,缓存未命中率上升115%;而流水线饱和攻击则引发严重的链路级拥塞与显著的放大效应,即微小的verb请求会导致不成比例的高资源消耗。为缓解这些威胁并恢复可预测的安全保障,我们提出HT-Verbs框架——一种基于容器级实时RDMA verb遥测数据与自适应资源分类的阈值驱动方案。该框架将RNIC资源划分为热、温、冷三个层级,无需硬件修改即可对异常工作负载进行限流控制。