Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present FastBERT, a BERT variant that uses 0.3\% of its neurons during inference while performing on par with similar BERT models. FastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.
翻译:语言模型在单次推理中实际上仅需使用其神经元的指数级比例。作为佐证,我们提出FastBERT——一种推理时仅动用0.3%神经元却能媲美同类BERT模型的变体。FastBERT在每层推理中选择性激活4095个神经元中的12个,这通过用快速前馈网络(FFF)替代传统前馈网络实现。尽管目前尚缺乏能完全释放条件神经执行加速潜能的真正高效实现,我们提供的CPU级代码实现了相较于优化基线前馈实现的78倍加速,PyTorch实现则相较等效批处理前馈推理达到40倍加速。我们公开了训练代码、基准测试框架及模型权重。