DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization

Since American Sign Language (ASL) has no standard written form, Deaf signers frequently share videos in order to communicate in their native language. However, since both hands and face convey critical linguistic information in signed languages, sign language videos cannot preserve signer privacy. While signers have expressed interest, for a variety of applications, in sign language video anonymization that would effectively preserve linguistic content, attempts to develop such technology have had limited success, given the complexity of hand movements and facial expressions. Existing approaches rely predominantly on precise pose estimations of the signer in video footage and often require sign language video datasets for training. These requirements prevent them from processing videos 'in the wild,' in part because of the limited diversity present in current sign language video datasets. To address these limitations, our research introduces DiffSLVA, a novel methodology that utilizes pre-trained large-scale diffusion models for zero-shot text-guided sign language video anonymization. We incorporate ControlNet, which leverages low-level image features such as HED (Holistically-Nested Edge Detection) edges, to circumvent the need for pose estimation. Additionally, we develop a specialized module dedicated to capturing facial expressions, which are critical for conveying essential linguistic information in signed languages. We then combine the above methods to achieve anonymization that better preserves the essential linguistic content of the original signer. This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications, which would offer significant benefits to the Deaf and Hard-of-Hearing communities. We demonstrate the effectiveness of our approach with a series of signer anonymization experiments.

翻译：摘要：由于美国手语（ASL）没有标准的书面形式，聋哑手语者常通过分享视频来使用其母语进行交流。然而，由于双手和面部在手语中均传递关键的语言信息，手语视频无法保护手语者的隐私。尽管手语者出于多种应用需求，对能有效保留语言内容的手语视频匿名化技术表现出兴趣，但鉴于手部动作和面部表情的复杂性，此类技术的开发迄今进展有限。现有方法主要依赖对视频中手语者姿态的精确估计，且通常需要手语视频数据集进行训练。这些要求使得它们无法处理“野外”视频，部分原因是当前手语视频数据集多样性不足。为应对这些局限性，本研究提出DiffSLVA，一种利用预训练大规模扩散模型实现零样本文本引导手语视频匿名化的新方法。我们引入ControlNet，利用HED（整体嵌套边缘检测）边缘等低层图像特征，从而避免对姿态估计的依赖。此外，我们开发了专门用于捕捉面部表情的模块，这些表情对于传递手语中的关键语言信息至关重要。随后，我们整合上述方法，实现能更好保留原始手语者核心语言内容的匿名化。这一创新方法首次使得手语视频匿名化可用于实际应用，将为聋哑及听障社群带来显著益处。我们通过一系列手语者匿名化实验证明了方法的有效性。