Frustrating text entry interface has been a major obstacle in participating in social activities in augmented reality (AR). Popular options, such as mid-air keyboard interface, wireless keyboards or voice input, either suffer from poor ergonomic design, limited accuracy, or are simply embarrassing to use in public. This paper proposes and validates a deep-learning based approach, that enables AR applications to accurately predict keystrokes from the user perspective RGB video stream that can be captured by any AR headset. This enables a user to perform typing activities on any flat surface and eliminates the need of a physical or virtual keyboard. A two-stage model, combing an off-the-shelf hand landmark extractor and a novel adaptive Convolutional Recurrent Neural Network (C-RNN), was trained using our newly built dataset. The final model was capable of adaptive processing user-perspective video streams at ~32 FPS. This base model achieved an overall accuracy of $91.05\%$ when typing 40 Words per Minute (wpm), which is how fast an average person types with two hands on a physical keyboard. The Normalised Levenshtein Distance also further confirmed the real-world applicability of that our approach. The promising results highlight the viability of our approach and the potential for our method to be integrated into various applications. We also discussed the limitations and future research required to bring such technique into a production system.
翻译:令人沮丧的文本输入界面一直是增强现实(AR)中参与社交活动的主要障碍。空中键盘界面、无线键盘或语音输入等流行方案,要么存在糟糕的人体工学设计,要么精度有限,要么在公共场合使用令人尴尬。本文提出并验证了一种基于深度学习的方法,使AR应用能够从任何AR头戴设备捕获的用户视角RGB视频流中准确预测按键操作。这使得用户能够在任何平坦表面上进行打字活动,并消除了对物理或虚拟键盘的需求。我们利用新构建的数据集训练了一个两阶段模型,该模型结合了现成的手部地标提取器与一种新颖的自适应卷积循环神经网络(C-RNN)。最终模型能够以约32 FPS的速度自适应处理用户视角视频流。该基础模型在每分钟40词(wpm)的打字速度下(相当于普通人用双手在物理键盘上的平均打字速度)达到了$91.05\%$的整体准确率。归一化莱文斯坦距离进一步验证了我们方法的现实应用可行性。这些令人鼓舞的结果凸显了我们方法的可行性,以及将其集成到各种应用中的潜力。我们还讨论了将该技术引入生产系统所需的局限性和未来研究方向。