Previous Sign Language Translation (SLT) methods achieve superior performance by relying on gloss annotations. However, labeling high-quality glosses is a labor-intensive task, which limits the further development of SLT. Although some approaches work towards gloss-free SLT through jointly training the visual encoder and translation network, these efforts still suffer from poor performance and inefficient use of the powerful Large Language Model (LLM). Most seriously, we find that directly introducing LLM into SLT will lead to insufficient learning of visual representations as LLM dominates the learning curve. To address these problems, we propose Factorized Learning assisted with Large Language Model (FLa-LLM) for gloss-free SLT. Concretely, we factorize the training process into two stages. In the visual initialing stage, we employ a lightweight translation model after the visual encoder to pre-train the visual encoder. In the LLM fine-tuning stage, we freeze the acquired knowledge in the visual encoder and integrate it with a pre-trained LLM to inspire the LLM's translation potential. This factorized training strategy proves to be highly effective as evidenced by significant improvements achieved across three SLT datasets which are all conducted under the gloss-free setting.
翻译:先前的手语翻译方法依赖于标注(gloss)实现卓越性能,但高质量标注的标注工作耗时费力,限制了手语翻译的进一步发展。尽管部分研究通过联合训练视觉编码器与翻译网络推动了无标注手语翻译的发展,但这些方法仍面临性能不足以及未能有效利用大语言模型的问题。更关键的是,我们发现直接将大语言模型引入手语翻译会导致视觉表示学习不充分——大语言模型会主导学习进程。为解决这些问题,我们提出基于大语言模型的分层学习辅助方法(FLa-LLM)用于无标注手语翻译。具体而言,我们将训练过程分解为两个阶段:在视觉初始化阶段,我们在视觉编码器后接入轻量级翻译模型进行预训练;在大语言模型微调阶段,冻结视觉编码器的已习得知识,并将其与预训练的大语言模型集成以激发其翻译潜力。这种分层训练策略在三个手语翻译数据集的实验中均展现出显著效果——所有实验均在无标注条件下进行。