Toward American Sign Language Processing in the Real World: Data, Tasks, and Methods

Sign language, which conveys meaning through gestures, is the chief means of communication among deaf people. Recognizing sign language in natural settings presents significant challenges due to factors such as lighting, background clutter, and variations in signer characteristics. In this thesis, I study automatic sign language processing in the wild, using signing videos collected from the Internet. This thesis contributes new datasets, tasks, and methods. Most chapters of this thesis address tasks related to fingerspelling, an important component of sign language and yet has not been studied widely by prior work. I present three new large-scale ASL datasets in the wild: ChicagoFSWild, ChicagoFSWild+, and OpenASL. Using ChicagoFSWild and ChicagoFSWild+, I address fingerspelling recognition, which consists of transcribing fingerspelling sequences into text. I propose an end-to-end approach based on iterative attention that allows recognition from a raw video without explicit hand detection. I further show that using a Conformer-based network jointly modeling handshape and mouthing can bring performance close to that of humans. Next, I propose two tasks for building real-world fingerspelling-based applications: fingerspelling detection and search. For fingerspelling detection, I introduce a suite of evaluation metrics and a new detection model via multi-task training. To address the problem of searching for fingerspelled keywords in raw sign language videos, we propose a novel method that jointly localizes and matches fingerspelling segments to text. Finally, I will describe a benchmark for large-vocabulary open-domain sign language translation based on OpenASL. To address the challenges of sign language translation in realistic settings, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features.

翻译：手语通过手势传递意义，是聋人群体沟通的主要方式。在自然场景中识别手语面临着光照、背景杂乱及手语者特征差异等显著挑战。本论文研究基于互联网采集的手语视频，探索真实环境下的自动手语处理。本文贡献了新的数据集、任务与方法。论文多数章节聚焦于手指拼写相关任务——这是手语的重要组成部分，但此前研究鲜有涉及。我提出了三个大规模真实场景美式手语数据集：ChicagoFSWild、ChicagoFSWild+和OpenASL。基于ChicagoFSWild与ChicagoFSWild+，我研究了手指拼写识别任务，即对手指拼写序列进行文本转录。提出了一种基于迭代注意力的端到端方法，无需显式手部检测即可从原始视频进行识别。进一步证明，采用联合建模手形与口型的Conformer网络，可使性能接近人类水平。随后，我提出了两个面向真实世界手指拼写应用的任务：手指拼写检测与搜索。针对检测任务，引入了一套评估指标及基于多任务训练的新型检测模型。为解决原始手语视频中手指拼写关键词搜索问题，我们提出了一种联合定位手指拼写片段并与文本匹配的新方法。最后，我将介绍基于OpenASL构建的大词表开放域手语翻译基准。为应对现实场景中手语翻译的挑战，我们提出了一系列技术，包括将手语搜索作为预训练的代理任务，以及融合口型与手形特征的方法。