Silent speech interfaces (SSIs) enable silent interaction in noise-sensitive or privacy-sensitive settings. However, existing SSIs face practical deployment trade-offs among privacy, user experience, and energy consumption, and most remain limited to closed-set recognition over small, pre-defined vocabularies of words or sentences, which restricts real-world expressiveness. In this paper, we present Lip-Siri, to the best of our knowledge, the first Wi-Fi backscatter--based SSI that supports open-vocabulary sentence recognition via lexicon-guided subword decoding. Lip-Siri designs a frequency-shifted backscatter tag to isolate tag-modulated reflections and suppress interference from non-target motions, enabling reliable extraction of lip-motion traces from ubiquitous Wi-Fi signals. We then segment continuous traces into lip-motion units, cluster them, learn robust unit representations via cluster-based self-supervision, and finally propose a lexicon-guided Transformer encoder--decoder with beam search to decode variable-length sentence sequences. We implement an end-to-end prototype and evaluate it with 15 participants on 340 sentences and 3,398 words across multiple scenarios. Lip-Siri achieves 85.61% accuracy on word prediction and a WER of 36.87% on continuous sentence recognition, approaching the performance of representative vision-based lip-reading systems.
翻译:无声语音接口(SSI)能够在噪声敏感或隐私敏感的场景中实现静默交互。然而,现有SSI在隐私性、用户体验和能耗方面面临实际部署的权衡,且大多局限于对小型预定义词汇或语句的封闭集识别,这限制了其在真实世界中的表达能力。本文提出Lip-Siri,据我们所知,这是首个基于Wi-Fi反向散射、通过词典引导的子词解码支持开放词汇语句识别的SSI。Lip-Siri设计了一种频移反向散射标签,以隔离标签调制的反射并抑制非目标运动的干扰,从而能够从无处不在的Wi-Fi信号中可靠地提取唇部运动轨迹。随后,我们将连续轨迹分割为唇部运动单元,对其进行聚类,通过基于聚类的自监督学习获得鲁棒的单元表示,最终提出一种结合束搜索的词典引导Transformer编码器-解码器来解码可变长度的语句序列。我们实现了一个端到端原型系统,并在多个场景下对15名参与者进行了评估,测试集包含340个句子和3,398个单词。Lip-Siri在单词预测上达到了85.61%的准确率,在连续语句识别上的词错误率(WER)为36.87%,其性能已接近代表性的基于视觉的唇读系统。