Isolated Sign Language Recognition (ISLR) is challenged by gestures that are morphologically similar yet semantically distinct, a problem rooted in the complex interplay between hand shape and motion trajectory. Existing methods, often relying on a single reference frame, struggle to resolve this geometric ambiguity. This paper introduces Dual-SignLanguageNet (DSLNet), a dual-reference, dual-stream architecture that decouples and models gesture morphology and trajectory in separate, complementary coordinate systems. Our approach utilizes a wrist-centric frame for view-invariant shape analysis and a facial-centric frame for context-aware trajectory modeling. These streams are processed by specialized networks-a topology-aware graph convolution for shape and a Finsler geometry-based encoder for trajectory-and are integrated via a geometry-driven optimal transport fusion mechanism. DSLNet sets a new state-of-the-art, achieving 93.70%, 89.97% and 99.79% accuracy on the challenging WLASL-100, WLASL-300 and LSA64 datasets, respectively, with significantly fewer parameters than competing models.
翻译:孤立手语识别面临形态相似但语义不同的手势挑战,这一问题源于手部形状与运动轨迹之间复杂的相互作用。现有方法通常依赖单一参考系,难以解决此类几何歧义。本文提出双参考系双流架构Dual-SignLanguageNet,通过在两个互补坐标系中解耦并建模手势形态与运动轨迹。该方法采用腕部中心坐标系实现视角不变的手形分析,同时利用面部中心坐标系进行上下文感知的轨迹建模。两个分支分别由专用网络处理——采用拓扑感知图卷积处理手形特征,基于Finsler几何的编码器处理轨迹特征,并通过几何驱动的最优传输融合机制进行集成。DSLNet在参数量显著少于竞争模型的前提下,在具有挑战性的WLASL-100、WLASL-300和LSA64数据集上分别达到93.70%、89.97%和99.79%的准确率,创造了新的性能标杆。