The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository (https://github.com/Leebh-kor/TD3Net).
翻译:单词级唇读方法通常采用前端与后端架构分离的两阶段框架来建模动态唇部运动。每个组件都得到了广泛研究,在后端架构中,时间卷积网络(TCN)已在最先进方法中被广泛采用。近期,TCN中引入了密集跳跃连接以缓解感受野密度有限的问题,从而提升复杂时间表征的建模能力。然而,由于感受野中的盲点可能导致唇部运动连续性的信息损失,其性能仍受限制。为解决这一局限,我们提出TD3Net——一种时间密集连接多膨胀卷积网络,它结合了密集跳跃连接与多膨胀时间卷积作为后端架构。TD3Net通过对跳跃连接特征施加不同的膨胀因子,实现了无盲点的宽域密集感受野覆盖。在基于两个大型公开数据集(Lip Reading in the Wild (LRW) 和 LRW-1000)的单词级唇读任务上的实验结果表明,所提方法达到了与最先进方法相当的性能。与现有基于TCN的后端架构相比,该方法以更少的参数和更低的浮点运算量实现了更高的准确率。此外,可视化结果表明我们的方法在保持时间连续性的同时有效利用了多样化时间特征,在唇读系统中展现出显著优势。代码已发布于GitHub仓库(https://github.com/Leebh-kor/TD3Net)。