Speech recognition assisted by large language models to command software orally -- Application to an augmented and virtual reality web app for immersive molecular graphics

翻译：基于大型语言模型的语音识别辅助软件口头操控——在沉浸式分子图形增强与虚拟现实Web应用中的实现

Fabio Cortes Rodriguez,Luciano Abriata

This project successfully developed, evaluated and integrated a Voice User Interface (VUI) into a web application that we are developing for immersive molecular graphics. Said app provides augmented and virtual reality (AR and VR) environments where users manipulate molecules with their hands, but this means the hands can't be used to control the app through a regular mouse- and keyboard-based GUI. The speech-based VUI system developed here alleviates this problem, making it easy to control the app via natural spoken (or typed) commands. To achieve this VUI we evaluated two distinct Automated Speech Recognition (ASR) systems: Chrome's native Speech API and OpenAI's Whisper v3. While Whisper offered broader browser compatibility, its tendency to "hallucinate" with specialized scientific jargon proved very problematic. Consequently, we selected Chrome's ASR for its stability, speed, and reliability. For translating transcribed speech into software commands, we tested two Large Language Model (LLM)-driven approaches: either generating executable code, or calling predefined functions. The function call method, powered by OpenAI's GPT-4o-mini, was ultimately adopted due to its superior safety, efficiency, and reliability over the more complex and error-prone code-generation approach. The resulting VUI is then based on an integration of Chrome's ASR with our LLM-based function-calling module, enabling users to command the application using natural language as shown in a video linked inside this report. We provide links to live examples demonstrating all the intermediate components, and details on how we crafted the LLM's prompt in order to teach it the function calls as well as ways to clean up the transcribed speech and to explain itself while generating function calls. For best demonstration of the final system, we provide a video example.

翻译：本项目成功开发、评估并将语音用户界面集成至一款正在开发的沉浸式分子图形Web应用程序中。该应用提供增强现实与虚拟现实环境，用户可通过手势操控分子，但这也意味着无法使用双手通过传统的鼠标键盘图形界面控制应用。本文开发的语音VUI系统有效解决了这一问题，使得通过自然语音（或文本）指令便捷控制应用成为可能。为实现此VUI，我们评估了两种自动语音识别系统：Chrome原生Speech API与OpenAI Whisper v3。虽然Whisper具备更广泛的浏览器兼容性，但其在处理专业科学术语时易产生"幻觉"的问题十分突出。因此，我们最终选择Chrome ASR系统，因其在稳定性、速度与可靠性方面的优势。针对语音转录文本到软件指令的转换，我们测试了两种大型语言模型驱动方案：生成可执行代码或调用预定义函数。基于OpenAI GPT-4o-mini的函数调用方法因其在安全性、效率与可靠性上的显著优势而被采纳，相较于更复杂且易出错的代码生成方案更具实用性。最终实现的VUI整合了Chrome ASR与基于LLM的函数调用模块，用户可通过自然语言指令操控应用，具体演示可见报告内附视频链接。我们提供了展示所有中间组件的实时示例链接，并详细说明了如何设计LLM提示以指导函数调用、优化语音转录文本，以及在生成函数调用时实现自解释功能。为最佳展示最终系统，我们提供了视频演示示例。