Silent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947 is achievable only using one shot, and its performance can be further boosted by adaptively learning from more data. This generalizability allowed us to develop a mobile silent speech interface empowered with on-device fine-tuning and visual keyword spotting. A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme. Subjective feedback indicated that our system provides essential functionalities for customizable silent speech interactions with high usability and learnability.
翻译:摘要:无声语音接口是一项前景广阔的技术,能够实现自然语言下的隐私通信。然而,以往的方法仅支持词汇量小且不灵活的词汇集,导致表达能力受限。我们利用对比学习来学习高效的唇读表示,从而以最小的用户工作量实现少样本命令定制。我们的模型在野外数据集上对不同的光照、姿态和手势条件表现出高鲁棒性。对于25个命令的分类,仅使用单样本即可达到0.8947的F1分数,其性能可通过自适应地从更多数据中学习进一步提升。这种泛化能力使我们能够开发出一款移动无声语音接口,该接口具备设备端微调和视觉关键词检测功能。一项用户研究表明,借助LipLearner,用户可通过在线增量学习方案定义自己的命令,并确保高可靠性。主观反馈表明,我们的系统为可定制的无声语音交互提供了必要功能,具有高可用性和易学性。