Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt and mature software support. Its successor, the NVIDIA L4 GPU, introduces improvements in Tensor Core throughput, cache capacity, memory bandwidth, and parallel execution capability. However, limited empirical evidence quantifies the practical inference performance gap between these two generations under controlled and reproducible conditions. This work introduces DEEP-GAP, a systematic evaluation extending the GDEV-AI methodology to GPU inference. Using identical configurations and workloads, we evaluate ResNet18, ResNet50, and ResNet101 across FP32, FP16, and INT8 precision modes using PyTorch and TensorRT. Results show that reduced precision significantly improves performance, with INT8 achieving up to 58x throughput improvement over CPU baselines. L4 achieves up to 4.4x higher throughput than T4 while reaching peak efficiency at smaller batch sizes between 16 and 32, improving latency-throughput tradeoffs for latency-sensitive workloads. T4 remains competitive for large batch workloads where cost or power efficiency is important. DEEP-GAP provides practical guidance for selecting precision modes, batch sizes, and GPU architectures for modern inference deployments.
翻译:现代数据中心日益依赖低功耗、单槽位推理加速器以平衡性能、能效与机架密度约束。NVIDIA T4 GPU凭借其卓越的每瓦性能和成熟的软件生态支持,已成为广泛部署的解决方案。其继任者NVIDIA L4 GPU在Tensor Core吞吐量、缓存容量、内存带宽及并行执行能力方面实现了显著提升。然而,在受控且可复现条件下,两代产品在实际推理性能差距方面的实证量化证据仍显不足。本文提出DEEP-GAP——一种将GDEV-AI方法论系统拓展至GPU推理场景的评估框架。基于相同配置与负载,我们采用ResNet18、ResNet50及ResNet101模型,分别在FP32、FP16与INT8精度模式下,通过PyTorch和TensorRT框架进行全面评估。实验结果表明:降低精度可显著提升性能,其中INT8模式相比CPU基线最高实现58倍吞吐量增益;L4相较T4最高获得4.4倍吞吐量提升,且在16至32的较小批量范围内达到峰值效率,有效优化了延迟敏感型场景下的延迟-吞吐量权衡。对于成本或功耗敏感性的大批量负载场景,T4仍具有竞争力。DEEP-GAP为现代推理部署中精度模式、批量大小及GPU架构的选择提供了实践指导。