Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples and expensive in training and inference. To address this, we propose a sample-based Dynamic Hierarchical Transformer (DHT) model whose layers and heads can be dynamically configured with single data samples via solving contextual bandit problems. To determine the number of layers and heads, we use the Uniform Confidence Bound while we deploy combinatorial Thompson Sampling in order to select specific head combinations given their number. Different from previous work that focuses on compressing trained networks for inference only, DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. To the best of our knowledge, this is the first comprehensive data-driven dynamic transformer without any additional auxiliary neural networks that implement the dynamic system. According to the experiment results, we achieve up to 74% computational savings for both training and inference with a minimal loss of accuracy.
翻译:Transformer固定数量的层和头结构使其难以适应单个样本的复杂度,导致训练和推理成本高昂。为解决此问题,我们提出基于样本的动态层级Transformer模型(DHT),该模型通过求解上下文赌博机问题,能够针对单个数据样本动态配置层数和注意力头数。在确定层数和头数时,我们采用均匀置信界方法;在给定头数条件下选择具体头组合时,则部署组合式汤普森采样。不同于以往仅针对推理阶段压缩训练网络的方案,DHT不仅能在训练过程中自适应优化底层网络架构,还能在推理时保持网络灵活性。据我们所知,这是首个无需额外辅助神经网络实现动态机制的全数据驱动型动态Transformer。实验结果表明,该方法在仅产生极小精度损失的前提下,可为训练和推理阶段节省高达74%的计算量。