Silent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947 is achievable only using one shot, and its performance can be further boosted by adaptively learning from more data. This generalizability allowed us to develop a mobile silent speech interface empowered with on-device fine-tuning and visual keyword spotting. A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme. Subjective feedback indicated that our system provides essential functionalities for customizable silent speech interactions with high usability and learnability.
翻译:论文摘要:无声语音接口是一项极具前景的技术,能够实现以自然语言进行的私密通信。然而,现有方法仅支持词汇量小且僵化的指令集,导致表达力受限。我们利用对比学习来学习高效的唇读表征,从而以最少用户操作实现少样本指令定制。我们的模型在野外数据集上展现出对光照、姿态及手势条件的高度鲁棒性。针对25条指令的分类任务,仅需单样本即可达到0.8947的F1分数,且通过自适应地从更多数据中持续学习,其性能可进一步提升。这种泛化能力使我们得以开发一种移动端无声语音接口,该接口具备设备端微调与视觉关键词唤醒功能。用户研究表明,借助LipLearner,用户可通过在线增量学习方案定义个性化的高可靠性指令。主观反馈显示,我们的系统为可定制无声语音交互提供了核心功能,兼具高可用性与易学性。