SignDPO: Multi-level Direct Preference Optimisation for Skeleton-based Gloss-free Sign Language Translation

We present SignDPO, a novel multi-level Direct Preference Optimisation (DPO) framework designed to enhance the alignment of skeleton-based Sign Language Translation. While current skeleton-based models have made significant progress using Maximum Likelihood Estimation, they are primarily constrained by an imitation-based paradigm that lacks discriminative sensitivity to the fine-grained spatio-temporal nuances of sign language, often leading to semantic drift. To address this, SignDPO shifts the optimisation goal from simple sequence mimicry to structured preference alignment across spatial, temporal, and linguistic dimensions. Our framework involves three key designs. First, we introduce a hierarchical perturbation strategy to construct spatial and temporal non-preferred samples at both global and local granularities automatically. Second, we propose a self-guiding mechanism that leverages decoder cross-attention scores to identify and perturb semantically salient skeletal regions, forcing the model to distinguish genuine sign signals from structural distortions. Third, we establish an automated language-level preference generator by fine-tuning a dedicated perturbation model, capturing complex output-level failure modes without manual annotation. Extensive experiments on three widely adopted benchmarks, CSL-Daily, How2Sign, and OpenASL, demonstrate that SignDPO consistently outperforms state-of-the-art gloss-free methods and even rivals established gloss-based ones. Our results suggest that multi-level preference alignment is a powerful paradigm for bridging the gap between high-entropy skeletal trajectories and discrete linguistic semantics.

翻译：我们提出了SignDPO，一种新颖的多级直接偏好优化框架，旨在增强基于骨架的手语翻译的对齐能力。尽管当前基于骨架的模型使用最大似然估计取得了显著进展，但它们主要受限于基于模仿的范式，缺乏对手语细粒度时空细微差别的判别敏感性，常常导致语义漂移。为解决这一问题，SignDPO将优化目标从简单的序列模仿转变为跨空间、时间和语言维度的结构化偏好对齐。我们的框架包含三个关键设计。首先，我们引入了一种分层扰动策略，以在全局和局部粒度上自动构建空间和时间非偏好样本。其次，我们提出了一种自引导机制，利用解码器交叉注意力得分来识别和扰动语义显著的骨架区域，迫使模型区分真实手语信号与结构扭曲。第三，通过微调专用扰动模型，我们建立了一个自动化的语言级偏好生成器，无需手动标注即可捕捉复杂的输出级失效模式。在三个广泛采用的基准测试（CSL-Daily、How2Sign和OpenASL）上进行的大量实验表明，SignDPO始终优于最先进的无词汇方法，甚至可与成熟的基于词汇的方法相媲美。我们的结果表明，多级偏好对齐是一种有力的范式，可弥合高熵骨架轨迹与离散语言语义之间的差距。