We introduce EELBERT, an approach for compression of transformer-based models (e.g., BERT), with minimal impact on the accuracy of downstream tasks. This is achieved by replacing the input embedding layer of the model with dynamic, i.e. on-the-fly, embedding computations. Since the input embedding layer accounts for a significant fraction of the model size, especially for the smaller BERT variants, replacing this layer with an embedding computation function helps us reduce the model size significantly. Empirical evaluation on the GLUE benchmark shows that our BERT variants (EELBERT) suffer minimal regression compared to the traditional BERT models. Through this approach, we are able to develop our smallest model UNO-EELBERT, which achieves a GLUE score within 4% of fully trained BERT-tiny, while being 15x smaller (1.2 MB) in size.
翻译:我们提出EELBERT方法,用于压缩基于Transformer的模型(例如BERT),同时最大限度减少对下游任务精度的影响。该方法通过将模型的输入嵌入层替换为动态(即即时计算)嵌入函数实现。由于输入嵌入层占模型大小的显著比例(尤其是较小的BERT变体),将其替换为嵌入计算函数可大幅减小模型尺寸。在GLUE基准上的实证评估表明,与传统BERT模型相比,我们的BERT变体(EELBERT)的回归损失极小。通过该方法,我们成功开发出最小的模型UNO-EELBERT,其GLUE得分仅比完全训练的BERT-tiny低4%,而体积缩小了15倍(仅1.2 MB)。