Transformer-based Language Models have become ubiquitous in Natural Language Processing (NLP) due to their impressive performance on various tasks. However, expensive training as well as inference remains a significant impediment to their widespread applicability. While enforcing sparsity at various levels of the model architecture has found promise in addressing scaling and efficiency issues, there remains a disconnect between how sparsity affects network topology. Inspired by brain neuronal networks, we explore sparsity approaches through the lens of network topology. Specifically, we exploit mechanisms seen in biological networks, such as preferential attachment and redundant synapse pruning, and show that principled, model-agnostic sparsity approaches are performant and efficient across diverse NLP tasks, spanning both classification (such as natural language inference) and generation (summarization, machine translation), despite our sole objective not being optimizing performance. NeuroPrune is competitive with (or sometimes superior to) baselines on performance and can be up to $10$x faster in terms of training time for a given level of sparsity, simultaneously exhibiting measurable improvements in inference time in many cases.
翻译:基于Transformer的语言模型因其在各种任务上的出色表现,已在自然语言处理领域变得无处不在。然而,昂贵的训练与推理成本仍是其广泛应用的主要障碍。尽管在模型架构不同层面施加稀疏性在解决扩展性与效率问题方面展现出前景,但稀疏性如何影响网络拓扑结构仍存在认知断层。受脑神经网络启发,我们从网络拓扑视角探索稀疏性方法。具体而言,我们借鉴生物网络中的机制(如优先连接和冗余突触修剪),并证明:尽管我们的唯一目标并非优化性能,但原则性、模型无关的稀疏性方法在涵盖分类(如自然语言推理)与生成(摘要、机器翻译)的多种自然语言处理任务中均能实现高性能与高效率。在给定稀疏度水平下,NeuroPrune在性能上与基线方法具有竞争力(有时更优),训练速度可提升至$10$x倍,同时在许多情况下推理时间也呈现可测量的改善。