Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt and mature software support. Its successor, the NVIDIA L4 GPU, introduces improvements in Tensor Core throughput, cache capacity, memory bandwidth, and parallel execution capability. However, limited empirical evidence quantifies the practical inference performance gap between these two generations under controlled and reproducible conditions. This work introduces DEEP-GAP, a systematic evaluation extending the GDEV-AI methodology to GPU inference. Using identical configurations and workloads, we evaluate ResNet18, ResNet50, and ResNet101 across FP32, FP16, and INT8 precision modes using PyTorch and TensorRT. Results show that reduced precision significantly improves performance, with INT8 achieving up to 58x throughput improvement over CPU baselines. L4 achieves up to 4.4x higher throughput than T4 while reaching peak efficiency at smaller batch sizes between 16 and 32, improving latency-throughput tradeoffs for latency-sensitive workloads. T4 remains competitive for large batch workloads where cost or power efficiency is important. DEEP-GAP provides practical guidance for selecting precision modes, batch sizes, and GPU architectures for modern inference deployments.
翻译:摘要:现代数据中心日益依赖低功耗、单槽推理加速器,以平衡性能、能效与机架密度限制。NVIDIA T4 GPU凭借其出色的每瓦性能和完善的软件支持而得到广泛部署。其后续产品NVIDIA L4 GPU在张量核心吞吐量、缓存容量、内存带宽及并行执行能力方面均有改进。然而,在受控且可重复的条件下,量化这两代GPU间实际推理性能差距的实证证据仍有限。本文提出DEEP-GAP,这是一种将GDEV-AI方法论系统性地扩展到GPU推理中的评估方法。使用相同的配置与工作负载,我们基于PyTorch和TensorRT,在FP32、FP16和INT8精度模式下评估了ResNet18、ResNet50和ResNet101。结果表明,降低精度能显著提升性能,其中INT8模式相较于CPU基线实现了高达58倍的吞吐量提升。L4相比T4可实现高达4.4倍的吞吐量提升,同时在16至32的较小批量大小下达到峰值效率,优化了延迟敏感型工作负载的延迟-吞吐量权衡。对于注重成本或能效的大批量工作负载,T4仍具竞争力。DEEP-GAP为现代推理部署中选择精度模式、批量大小及GPU架构提供了实用指导。