Large-scale pretrained language models have achieved compelling performance in a wide range of language understanding and information retrieval tasks. Knowledge distillation offers an opportunity to compress a large language model to a small one, in order to reach a reasonable latency-performance tradeoff. However, for scenarios where the number of requests (e.g., queries submitted to a search engine) is highly variant, the static tradeoff attained by the compressed language model might not always fit. Once a model is assigned with a static tradeoff, it could be inadequate in that the latency is too high when the number of requests is large or the performance is too low when the number of requests is small. To this end, we propose an elastic language model (ElasticLM) that elastically adjusts the tradeoff according to the request stream. The basic idea is to introduce a compute elasticity to the compressed language model, so that the tradeoff could vary on-the-fly along scalable and controllable compute. Specifically, we impose an elastic structure to enable ElasticLM with compute elasticity and design an elastic optimization to learn ElasticLM under compute elasticity. To serve ElasticLM, we apply an elastic schedule. Considering the specificity of information retrieval, we adapt ElasticLM to dense retrieval and reranking and present ElasticDenser and ElasticRanker respectively. Offline evaluation is conducted on a language understanding benchmark GLUE; and several information retrieval tasks including Natural Question, Trivia QA, and MS MARCO. The results show that ElasticLM along with ElasticDenser and ElasticRanker can perform correctly and competitively compared with an array of static baselines. Furthermore, online simulation with concurrency is also carried out. The results demonstrate that ElasticLM can provide elastic tradeoffs with respect to varying request stream.
翻译:大规模预训练语言模型在语言理解与信息检索的广泛任务中取得了卓越性能。知识蒸馏为将大型语言模型压缩为小型模型提供了机会,以在延迟与性能之间达成合理的权衡。然而,对于请求数量(例如提交给搜索引擎的查询)高度变化的场景,压缩语言模型所取得的静态权衡可能无法始终适用。一旦模型被赋予固定的静态权衡,当请求数量激增时延迟过高,或请求数量稀少时性能低下,均可能产生不足。为此,我们提出弹性语言模型(ElasticLM),可根据请求流动态调整权衡。其基本思想是为压缩语言模型引入计算弹性,使得权衡能随可扩展且可控的计算能力实时变化。具体而言,我们通过施加弹性结构使ElasticLM具备计算弹性,并设计弹性优化方法以在计算弹性条件下学习ElasticLM。为部署ElasticLM,我们应用弹性调度策略。考虑到信息检索的特殊性,我们将ElasticLM适配至稠密检索与重排序任务,分别提出ElasticDenser与ElasticRanker。离线评估在语言理解基准GLUE及多项信息检索任务(包括Natural Question、Trivia QA和MS MARCO)上进行。结果表明,ElasticLM及其衍生模型ElasticDenser与ElasticRanker相较于一系列静态基线方法能够正确执行且具有竞争力。此外,我们还开展了带并发的在线模拟实验。结果显示,ElasticLM可针对变化的请求流提供弹性权衡。