This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity >= 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360 configurations. Cross-GPU validation on an NVIDIA A100 shows consistent FP16 speedup ratios between 1.84x and 2.00x, along with stable numerical behavior. Downstream evaluation on SST-2 demonstrates no accuracy degradation under hybrid precision. Validation on WikiText-2 shows that random inputs underestimate NaN instability by up to 6x for full FP16, while confirming the robustness of the hybrid approach (0.0 percent NaN, cosine similarity >= 0.9998). These results provide a detailed characterization of performance and accuracy trade-offs across GPU architectures and offer practical guidance for deploying transformer models in latency-critical environments.
翻译:本文提出并评估了一种基于GPU加速的Transformer模型推理流水线,采用NVIDIA TensorRT与混合精度优化技术。我们分别对BERT-base(1.1亿参数)和GPT-2(1.24亿参数)模型进行了批量大小从1到32、序列长度从32到512的全面评估。该系统相比CPU基线实现了最高64.4倍的加速比,单样本推理延迟低于10毫秒,内存占用降低63%。我们提出了一种混合精度策略:对softmax和层归一化等数值敏感操作保留FP32精度,而对线性层采用FP16精度。该方法保持了极高的数值保真度(与基线输出的余弦相似度≥0.9998),并有效消除了NaN不稳定性。该流水线以模块化容器化系统实现,支持在超过360种配置下进行可重复基准测试。在NVIDIA A100上进行的跨GPU验证表明,FP16加速比稳定在1.84倍至2.00倍之间,且数值行为稳定。在SST-2数据集上的下游评估显示,混合精度下未出现准确率下降。WikiText-2数据集上的验证表明,随机输入会将全FP16模式的NaN不稳定性低估最多6倍,同时证实了混合方法的鲁棒性(NaN率为0.0%,余弦相似度≥0.9998)。这些结果为跨GPU架构的性能与精度权衡提供了详细表征,为在延迟敏感环境中部署Transformer模型提供了实用指导。