Deploying large language models (LLMs) encounters challenges due to intensive computational and memory requirements. Our research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. While such modifications have been proven effective in tasks like machine translation, tailoring them to LLMs demands specific modifications given the diverse nature of LLM applications. We apply two language heuristics to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different LLM families and sizes. The methods are straightforward, interpretable, and easy to implement. It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed. Yet, we reveal the limitations of these methods in that they do not perform consistently well for each language with diminishing returns in larger models.
翻译:部署大语言模型(LLMs)面临计算和内存需求密集的挑战。本研究探讨了基于语言启发式的词汇裁剪(VT)方法,通过将嵌入词条限制在目标语言范围内来提升时间和内存效率。尽管这类修改在机器翻译等任务中已被证明有效,但由于LLM应用的多样性,将其适配至LLM需要针对性的修改。我们采用两种语言启发式方法对完整词汇表进行裁剪——基于Unicode的脚本过滤和基于语料库的筛选——并应用于不同系列和规模的大语言模型。这些方法直观、可解释且易于实现。研究发现,词汇裁剪可将小型模型的内存使用减少近50%,并带来最高25%的生成速度提升。然而,我们揭示了这些方法的局限性:它们无法保证对所有语言表现一致,且在大模型中收益递减。