Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words. That assumption breaks down because signers often create meaning on the fly using context, space, and movement. We revisit SLT and argue that it is mainly a cross-modal reasoning task, not just a straightforward video-to-text conversion. We thus introduce a reasoning-driven SLT framework that uses an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text. These latent thoughts gradually extract and organize meaning over time. On top of this, we use a plan-then-ground decoding method: the model first decides what it wants to say, and then looks back at the video to find the evidence. This separation improves coherence and faithfulness. We also built and released a new large-scale gloss-free SLT dataset with stronger context dependencies and more realistic meanings. Experiments across several benchmarks show consistent gains over existing gloss-free methods. Code and data will be released upon acceptance at https://github.com/fletcherjiang/SignThought.
翻译:许多手语翻译系统隐式假设手语中的短片段直接对应口语词汇。这一假设在实际中难以成立,因为手语者常通过上下文、空间和运动即时构建意义。本文重新审视手语翻译问题,指出其本质上是一种跨模态推理任务,而非简单的视频到文本转换。为此,我们提出一种基于推理的手语翻译框架,该框架在视频和生成文本之间引入有序的潜思序列作为显式中间层,这些潜思随时间逐步提取和组织语义信息。在此基础上,我们采用"计划-验证"解码方法:模型首先决定表达内容,再回溯视频寻找证据。这种分离机制增强了生成文本的连贯性和忠实度。我们还构建并发布了一个全新的无标注大规模手语翻译数据集,具有更强的上下文依赖性和更真实的多义表达。在多个基准测试上的实验表明,该方法相较于现有无标注方法持续取得显著提升。代码与数据集将在论文接收后于https://github.com/fletcherjiang/SignThought公开。