High-dimensional motion generation requires numerical precision for smooth, collision-free solutions. Typically, double-precision or single-precision floating-point (FP) formats are utilized. Using these for big tensors imposes a strain on the memory bandwidth provided by the devices and alters the memory footprint, hence limiting their applicability to low-power edge devices needed for mobile robots. The uniform application of reduced precision can be advantageous but severely degrades solutions. Using decreased precision data types for important tensors, we propose to accelerate motion generation by removing memory bottlenecks. We propose variable-precision (VaPr) search optimization to determine the appropriate precision for large tensors from a vast search space of approximately 4 million unique combinations for FP data types across the tensors. To obtain the efficiency gains, we exploit existing platform support for an out-of-the-box GPU speedup and evaluate prospective precision converter units for GPU types that are not currently supported. Our experimental results on 800 planning problems for the Franka Panda robot on the MotionBenchmaker dataset across 8 environments show that a 4-bit FP format is sufficient for the largest set of tensors in the motion generation stack. With the software-only solution, VaPr achieves 6.3% and 6.3% speedups on average for a significant portion of motion generation over the SOTA solution (CuRobo) on Jetson Orin and RTX2080 Ti GPU, respectively, and 9.9%, 17.7% speedups with the FP converter.
翻译:高维运动生成需要数值精度以确保平滑且无碰撞的解决方案。通常采用双精度或单精度浮点格式。将这些格式用于大型张量会加重设备内存带宽的负担并改变内存占用,从而限制了其在移动机器人所需的低功耗边缘设备上的适用性。统一降低精度可能有益,但会严重降低解决方案的质量。针对重要张量采用低精度数据类型,我们提出通过消除内存瓶颈来加速运动生成。我们提出可变精度搜索优化方法,从张量间约400万种浮点数据类型组合的庞大搜索空间中,为大型张量确定合适的精度。为获得效率提升,我们利用现有平台对现成GPU加速的支持,并评估了针对当前不支持的GPU类型的预期精度转换单元。我们在MotionBenchmaker数据集上对Franka Panda机器人的800个规划问题进行的实验表明,在8个环境设置下,运动生成堆栈中最大张量集采用4位浮点格式已足够。采用纯软件方案时,VaPr在Jetson Orin和RTX2080 Ti GPU上相比当前最优方案分别实现平均6.3%和6.3%的加速,若配合浮点转换器则可实现9.9%和17.7%的加速。