Can transformers generalize efficiently on problems that require dealing with examples with different levels of difficulty? We introduce a new task tailored to assess generalization over different complexities and present results that indicate that standard transformers face challenges in solving these tasks. These tasks are variations of pointer value retrieval previously introduced by Zhang et al. (2021). We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential computation steps (i.e., the depth of the computation graph). Based on our observations, we propose a transformer-based architecture called Hyper-UT, which combines dynamic function generation from hyper networks with adaptive depth from Universal Transformers. This model demonstrates higher accuracy and a fairer allocation of computational resources when generalizing to higher numbers of computation steps. We conclude that mechanisms for adaptive depth and modularity complement each other in improving efficient generalization concerning example complexity. Additionally, to emphasize the broad applicability of our findings, we illustrate that in a standard image recognition task, Hyper- UT's performance matches that of a ViT model but with considerably reduced computational demands (achieving over 70\% average savings by effectively using fewer layers).
翻译:能否让Transformer在需要处理不同难度示例的问题上实现高效泛化?我们引入了一项旨在评估不同复杂性下泛化能力的新任务,并给出结果表明标准Transformer在解决这些任务时面临挑战。这些任务是此前Zhang等人(2021)提出的指针值检索任务的变体。我们研究了Transformer中自适应与模块化计算机制如何促进对需要泛化到不同顺序计算步骤数量(即计算图深度)任务的学习。基于观察,我们提出一种名为Hyper-UT的基于Transformer的架构,该架构将超网络的动态函数生成能力与通用Transformer的自适应深度相结合。在泛化到更多计算步骤时,该模型展现出更高的准确率和更公平的计算资源分配。我们得出结论:自适应深度与模块化机制在改善与示例复杂性相关的高效泛化方面相互补充。此外,为强调发现广泛适用性,我们举例说明在标准图像识别任务中,Hyper-UT的性能与ViT模型相当,但计算需求显著降低(通过有效使用更少层数实现超过70%的平均节省)。