The success of deep learning in computer vision has been driven by models of increasing scale, from deep Convolutional Neural Networks (CNN) to large Vision Transformers (ViT). While effective, these architectures are parameter-intensive and demand significant computational resources, limiting deployment in resource-constrained environments. Inspired by Tiny Recursive Models (TRM), which show that small recursive networks can solve complex reasoning tasks through iterative state refinement, we introduce the \textbf{Vision Tiny Recursion Model (ViTRM)}: a parameter-efficient architecture that replaces the $L$-layer ViT encoder with a single tiny $k$-layer block ($k{=}3$) applied recursively $N$ times. Despite using up to $6 \times $ and $84 \times$ fewer parameters than CNN based models and ViT respectively, ViTRM maintains competitive performance on CIFAR-10 and CIFAR-100. This demonstrates that recursive computation is a viable, parameter-efficient alternative to architectural depth in vision.
翻译:深度学习在计算机视觉领域的成功,得益于模型规模的持续扩大——从深度卷积神经网络(CNN)到大型视觉Transformer(ViT)。尽管这些架构效果显著,但它们参数密集且需要大量计算资源,限制了其在资源受限环境中的部署。受极小递归模型(TRM)启发(该模型表明小型递归网络可通过迭代状态精炼解决复杂推理任务),我们提出了**视觉极小递归模型(ViTRM)**:一种参数高效的架构,它将$L$层ViT编码器替换为单个极小的$k$层块($k{=}3$),并递归应用$N$次。尽管ViTRM的参数数量分别比基于CNN的模型和ViT减少多达$6 \times$和$84 \times$,但其在CIFAR-10和CIFAR-100上仍能保持具有竞争力的性能。这表明,递归计算是视觉领域中一种可行且参数高效的架构深度替代方案。