Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.
翻译:Transformer模型已在机器学习领域引发革命性变革。特别地,它们可用于解决复杂的算法问题,包括基于图的任务。在此类算法任务中,一个关键问题是实现该任务所需Transformer的最小规模。近期研究开始探索基于图任务的这一问题,表明对于次线性嵌入维度(即模型宽度),对数级深度已足够。然而,一个尚未解决的问题(我们在本文中予以探讨)是:若允许宽度线性增长而深度保持固定,会产生何种效果?本文分析了这一设定,并给出了令人惊讶的结论:对于大量基于图的问题,线性宽度配合恒定深度即可实现求解。这表明适度的宽度增加可以支持更浅层的模型结构,这在推理和训练时间方面具有显著优势。对于其他问题,我们证明二次方宽度是必要条件。我们的研究结果揭示了基于图算法的Transformer实现所呈现的复杂而引人入胜的图景。我们通过实证研究探讨了深度与宽度相对能力之间的权衡关系,发现存在某些任务场景:较宽的模型能够达到与深层模型相同的精度,同时借助可并行化硬件实现更快的训练与推理速度。