Vision-based Transformer have shown huge application in the perception module of autonomous driving in terms of predicting accurate 3D bounding boxes, owing to their strong capability in modeling long-range dependencies between the visual features. However Transformers, initially designed for language models, have mostly focused on the performance accuracy, and not so much on the inference-time budget. For a safety critical system like autonomous driving, real-time inference at the on-board compute is an absolute necessity. This keeps our object detection algorithm under a very tight run-time budget. In this paper, we evaluated a variety of strategies to optimize on the inference-time of vision transformers based object detection methods keeping a close-watch on any performance variations. Our chosen metric for these strategies is accuracy-runtime joint optimization. Moreover, for actual inference-time analysis we profile our strategies with float32 and float16 precision with TensorRT module. This is the most common format used by the industry for deployment of their Machine Learning networks on the edge devices. We showed that our strategies are able to improve inference-time by 63% at the cost of performance drop of mere 3% for our problem-statement defined in evaluation section. These strategies brings down Vision Transformers detectors inference-time even less than traditional single-image based CNN detectors like FCOS. We recommend practitioners use these techniques to deploy Transformers based hefty multi-view networks on a budge-constrained robotic platform.
翻译:基于视觉的Transformer因其在建模视觉特征长程依赖关系方面的强大能力,在自动驾驶感知模块中精准预测三维边界框方面展现出巨大应用前景。然而,最初为语言模型设计的Transformer主要关注性能精度,而对推理时预算关注不足。对于自动驾驶这类安全关键系统,在车载计算平台上实现实时推理是绝对必要条件,这使我们的目标检测算法面临极紧的运行时间预算。本文评估了多种优化基于视觉Transformer目标检测方法推理时间的策略,并密切监测性能变化。我们选取精度-运行时联合优化作为这些策略的评估指标。此外,为进行实际推理时间分析,我们采用TensorRT模块以float32和float16精度对策略进行剖析,这是工业界在边缘设备部署机器学习网络时最常用的格式。结果表明,对于评估章节定义的问题场景,我们的策略能够以性能仅下降3%的代价实现推理时间63%的提升。这些策略使视觉Transformer检测器的推理时间甚至低于FCOS等传统基于单图像的CNN检测器。我们建议从业者采用这些技术,在预算受限的机器人平台上部署基于Transformer的大规模多视角网络。