Many artificial intelligence models process input data of different lengths and resolutions, making the shape of the tensors dynamic. The performance of these models depends on the shape of the tensors, which makes it difficult to optimize the tensors before the model runs. There are two common solutions to this problem. The first is to add useless data to the input to match a pre-optimized tensor library. The second is to use small basic tensors to create a tensor that is closest in size to the input data and then tune it to minimize padding. However, this second solution can be time-consuming. This paper proposes a new technique for deep learning compilers called FTuner. Instead of using a large design space or training a cost model, we use an abstract computational unit called the uKernel to patch together small, various-sized tensors to match the shape of the input tensor. We determine the shape of the uKernel using an analytic hardware information model. Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries and achieves 3\% speedup on existing auto-tuner with the model-training compiler while reducing tuning time by two orders of magnitude.
翻译:许多人工智能模型处理不同长度和分辨率的输入数据,使得张量的形状具有动态性。这些模型的性能取决于张量的形状,这导致在模型运行前难以对张量进行优化。针对此问题存在两种常见解决方案。第一种是向输入数据添加无用数据以匹配预先优化的张量库。第二种是使用小型基础张量构建与输入数据尺寸最接近的张量,然后通过调优最小化填充量。然而,第二种方案可能非常耗时。本文提出了一种面向深度学习编译器的新技术FTuner。我们既不使用庞大的设计空间,也不训练成本模型,而是采用称为uKernel的抽象计算单元,将不同尺寸的小型张量拼接以匹配输入张量的形状。我们通过解析硬件信息模型确定uKernel的形态。实验表明,FTuner在算子性能和端到端性能方面可与厂商库相媲美,在模型训练编译器上相比现有自动调优器实现3%的加速,同时将调优时间降低两个数量级。