Sequence-to-sequence language models can be used to produce abstractive summaries which are coherent, relevant, and concise. Still, model sizes can make deployment in latency-sensitive or web-scale implementations difficult. This paper studies the relationship between model size, structured pruning, inference efficiency, and summarization accuracy on widely used summarization datasets. We show that model accuracy is tied to the encoder size while inference efficiency is connected to the decoder. Using asymmetric pruning can lead to nearly 3x improvement in inference latency with ~1 point loss in Rouge-2. Moreover, we find both the average degradation and the role of asymmetry to be consistent across model sizes and variations in datasets.
翻译:序列到序列语言模型可用于生成连贯、相关且简洁的抽象式摘要。然而,模型规模过大可能使其难以部署于延迟敏感或大规模网络级别的应用场景中。本文针对广泛使用的摘要数据集,研究了模型规模、结构化剪枝、推理效率与摘要准确性之间的关系。研究表明,模型准确性与编码器规模相关,而推理效率则与解码器关联。采用非对称剪枝可在Rouge-2评分仅下降约1分的情况下,将推理延迟提升近3倍。此外,我们发现平均性能衰减程度及非对称性作用在不同模型规模和数据集变体上均保持一致。