Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples and expensive in training and inference. To address this, we propose a sample-based Dynamic Hierarchical Transformer (DHT) model whose layers and heads can be dynamically configured with single data samples via solving contextual bandit problems. To determine the number of layers and heads, we use the Uniform Confidence Bound while we deploy combinatorial Thompson Sampling in order to select specific head combinations given their number. Different from previous work that focuses on compressing trained networks for inference only, DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. To the best of our knowledge, this is the first comprehensive data-driven dynamic transformer without any additional auxiliary neural networks that implement the dynamic system. According to the experiment results, we achieve up to 74% computational savings for both training and inference with a minimal loss of accuracy.
翻译:Transformer需要固定数量的层和注意力头,这使得它们无法适应单个样本的复杂度,并在训练和推理过程中消耗大量资源。为解决这一问题,我们提出了一种基于采样的动态层次化Transformer(DHT)模型,该模型通过解决上下文赌博机问题,能够针对单个数据样本动态配置其层数和注意力头数。在确定层数和注意力头数时,我们采用均匀置信界,而对于给定注意力头数下的具体头部组合选择,则部署组合汤普森采样。与以往仅针对推理阶段压缩训练好的网络的研究不同,DHT不仅能在训练过程中自适应优化底层网络架构,还能提供灵活的推理网络结构。据我们所知,这是首个无需额外辅助神经网络实现动态机制的全方位数据驱动型动态Transformer。实验结果表明,该方法在保持极小精度损失的前提下,为训练和推理阶段节省了高达74%的计算量。