Vision Transformers have achieved state-of-the-art performance in a wide range of computer vision tasks, but their practical deployment is limited by high computational and memory costs. In this paper, we introduce a novel tensor-based framework for Vision Transformers built upon the Tensor Cosine Product (Cproduct). By exploiting multilinear structures inherent in image data and the orthogonality of cosine transforms, the proposed approach enables efficient attention mechanisms and structured feature representations. We develop the theoretical foundations of the tensor cosine product, analyze its algebraic properties, and integrate it into a new Cproduct-based Vision Transformer architecture (TCP-ViT). Numerical experiments on standard classification and segmentation benchmarks demonstrate that the proposed method achieves a uniform 1/C parameter reduction (where C is the number of channels) while maintaining competitive accuracy.
翻译:视觉Transformer已在广泛的计算机视觉任务中取得了最先进的性能,但其实际部署受到高计算和内存成本的限制。本文提出了一种基于张量余弦积(Cproduct)构建视觉Transformer的新型张量框架。通过利用图像数据固有的多线性结构和余弦变换的正交性,所提方法实现了高效的注意力机制和结构化特征表示。我们建立了张量余弦积的理论基础,分析了其代数性质,并将其集成到一种新的基于Cproduct的视觉Transformer架构(TCP-ViT)中。在标准分类和分割基准上的数值实验表明,所提方法在保持竞争力的精度的同时,实现了均匀的1/C参数缩减(其中C为通道数)。