Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples and expensive in training and inference. To address this, we propose a sample-based Dynamic Hierarchical Transformer (DHT) model whose layers and heads can be dynamically configured with single data samples via solving contextual bandit problems. To determine the number of layers and heads, we use the Uniform Confidence Bound while we deploy combinatorial Thompson Sampling in order to select specific head combinations given their number. Different from previous work that focuses on compressing trained networks for inference only, DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. To the best of our knowledge, this is the first comprehensive data-driven dynamic transformer without any additional auxiliary neural networks that implement the dynamic system. According to the experiment results, we achieve up to 74% computational savings for both training and inference with a minimal loss of accuracy.
翻译:Transformer需要固定数量的层和注意力头,导致其无法适应单个样本复杂度的差异,且训练与推理成本高昂。为此,本文提出基于样本的动态层级Transformer(DHT)模型,通过解决上下文赌博机问题,使各层和注意力头能够根据单数据样本动态配置。我们采用均匀置信界确定层数与头数,并通过组合汤普森采样在给定头数条件下选择具体的注意力头组合。不同于以往仅针对推理阶段的已训练网络压缩研究,DHT不仅能在训练过程中自适应优化底层网络架构,还具备高效推理的灵活网络结构。据我们所知,这是首个无需额外辅助神经网络实现动态系统的综合性数据驱动动态Transformer。实验结果显示,该方法在训练与推理阶段最高可节省74%的计算量,且精度损失极小。