Frustrating text entry interface has been a major obstacle in participating in social activities in augmented reality (AR). Popular options, such as mid-air keyboard interface, wireless keyboards or voice input, either suffer from poor ergonomic design, limited accuracy, or are simply embarrassing to use in public. This paper proposes and validates a deep-learning based approach, that enables AR applications to accurately predict keystrokes from the user perspective RGB video stream that can be captured by any AR headset. This enables a user to perform typing activities on any flat surface and eliminates the need of a physical or virtual keyboard. A two-stage model, combing an off-the-shelf hand landmark extractor and a novel adaptive Convolutional Recurrent Neural Network (C-RNN), was trained using our newly built dataset. The final model was capable of adaptive processing user-perspective video streams at ~32 FPS. This base model achieved an overall accuracy of $91.05\%$ when typing 40 Words per Minute (wpm), which is how fast an average person types with two hands on a physical keyboard. The Normalised Levenshtein Distance also further confirmed the real-world applicability of that our approach. The promising results highlight the viability of our approach and the potential for our method to be integrated into various applications. We also discussed the limitations and future research required to bring such technique into a production system.
翻译:令人沮丧的文本输入界面一直是增强现实(AR)社交活动参与的主要障碍。常见的输入方式,如空中键盘界面、无线键盘或语音输入,要么存在人体工学设计缺陷、精度有限,要么在公共场合使用尴尬。本文提出并验证了一种基于深度学习的方法,使AR应用能够从任何AR头显可捕获的用户视角RGB视频流中准确预测击键动作。这使得用户能够在任意平坦表面上执行打字操作,无需物理或虚拟键盘。我们利用新建数据集训练了一个两阶段模型,该模型结合了现成的手部地标提取器与新颖的自适应卷积循环神经网络(C-RNN)。最终模型能以约32 FPS的帧率自适应处理用户视角视频流。在40词/分钟(相当于普通人在物理键盘上双手打字的平均速度)的输入速率下,该基础模型达到了$91.05\%$的总体准确率。归一化莱文斯坦距离进一步验证了我们方法的实际适用性。这些令人鼓舞的结果凸显了该方法的可行性及其集成到各类应用中的潜力。我们还讨论了将该技术应用于生产系统所需解决的局限性与未来研究方向。