Large language model (LLM) inference is computation and memory intensive, so we adapt lexical shortlisting to it hoping to improve both. While lexical shortlisting is well-explored in tasks like machine translation, it requires modifications before being suitable for LLMs as the intended applications vary significantly. Our work studies two heuristics to shortlist sub-vocabulary at LLM inference time: Unicode-based script filtering and corpus-based selection. We explore different LLM families and sizes, and we find that lexical shortlisting can reduce the memory usage of some models by nearly 50\% and has an upper bound of 25\% improvement in generation speed. In this pilot study, we also identify the drawbacks of such vocabulary selection methods and propose avenues for future research.
翻译:大语言模型推理在计算和内存方面资源消耗巨大,为此我们引入词汇短列表技术以期望改善这两方面性能。尽管词汇短列表在机器翻译等任务中已有充分探索,但由于大语言模型的应用场景差异显著,该技术需要经过改进方可适用。本研究提出了两种推理时子词汇短列表的启发式策略:基于Unicode的脚本过滤方法和基于语料库的词汇筛选方法。通过探索不同系列与规模的大语言模型,我们发现词汇短列表能使部分模型的内存占用降低近50%,生成速度理论上限可提升25%。在本项先导研究中,我们还指出了此类词汇选择方法的局限性,并为未来研究指明了方向。