Large language models (LLMs) have achieved state-of-the-art performance in various language processing tasks, motivating their adoption in simultaneous translation. Current fine-tuning methods to adapt LLMs for simultaneous translation focus on prompting optimization strategies using either data augmentation or prompt structure modifications. However, these methods suffer from several issues, such as an unnecessarily expanded training set, computational inefficiency from dumping the KV cache, increased prompt sizes, or restriction to a single decision policy. To eliminate these issues, we propose a new paradigm in fine-tuning LLMs for simultaneous translation, called SimulMask. It utilizes a novel attention mask technique that models simultaneous translation during fine-tuning by masking attention connections under a desired decision policy. Applying the proposed SimulMask on a Falcon LLM for the IWSLT 2017 dataset, we have observed a significant translation quality improvement compared to state-of-the-art prompting optimization strategies on three language pairs when averaged across four different latency regimes while reducing the computational cost.
翻译:大语言模型(LLM)在各种语言处理任务中已取得最先进性能,这推动了其在同声传译中的应用。当前为适配同声传译而微调大语言模型的方法,主要聚焦于通过数据增强或提示结构修改的提示优化策略。然而,这些方法存在若干问题,如训练集不必要扩大、因丢弃KV缓存导致计算效率低下、提示规模增大,或局限于单一决策策略。为消除这些问题,我们提出了一种微调大语言模型实现同声传译的新范式——SimulMask。该技术采用创新的注意力掩码机制,在微调过程中通过按所需决策策略掩码注意力连接来模拟同声传译。我们将所提出的SimulMask应用于基于Falcon大语言模型的IWSLT 2017数据集,观察到在三种语言对中,其跨四种不同延迟机制的平均翻译质量相比最先进的提示优化策略有显著提升,同时降低了计算成本。