Model-Enhanced LLM-Driven VUI Testing of VPA Apps

The flourishing ecosystem centered around voice personal assistants (VPA), such as Amazon Alexa, has led to the booming of VPA apps. The largest app market Amazon skills store, for example, hosts over 200,000 apps. Despite their popularity, the open nature of app release and the easy accessibility of apps also raise significant concerns regarding security, privacy and quality. Consequently, various testing approaches have been proposed to systematically examine VPA app behaviors. To tackle the inherent lack of a visible user interface in the VPA app, two strategies are employed during testing, i.e., chatbot-style testing and model-based testing. The former often lacks effective guidance for expanding its search space, while the latter falls short in interpreting the semantics of conversations to construct precise and comprehensive behavior models for apps. In this work, we introduce Elevate, a model-enhanced large language model (LLM)-driven VUI testing framework. Elevate leverages LLMs' strong capability in natural language processing to compensate for semantic information loss during model-based VUI testing. It operates by prompting LLMs to extract states from VPA apps' outputs and generate context-related inputs. During the automatic interactions with the app, it incrementally constructs the behavior model, which facilitates the LLM in generating inputs that are highly likely to discover new states. Elevate bridges the LLM and the behavior model with innovative techniques such as encoding behavior model into prompts and selecting LLM-generated inputs based on the context relevance. Elevate is benchmarked on 4,000 real-world Alexa skills, against the state-of-the-art tester Vitas. It achieves 15% higher state space coverage compared to Vitas on all types of apps, and exhibits significant advancement in efficiency.

翻译：围绕语音个人助手（如亚马逊Alexa）蓬勃发展的生态系统，推动了VPA应用的繁荣。以最大的应用市场亚马逊技能商店为例，其托管的应用数量已超过20万。尽管这些应用广受欢迎，但应用发布的开放性及易获取性也引发了人们对安全性、隐私性和质量的重大担忧。因此，学界已提出多种测试方法以系统化检测VPA应用行为。为应对VPA应用固有的无可见用户界面这一挑战，测试中通常采用两种策略：聊天机器人式测试与基于模型的测试。前者往往缺乏扩展搜索空间的有效引导，而后者在理解对话语义以构建精确全面的应用行为模型方面存在不足。本研究提出Elevate，一个模型增强的大型语言模型驱动的语音用户界面测试框架。Elevate利用LLM在自然语言处理方面的强大能力，弥补基于模型的VUI测试过程中的语义信息损失。其工作原理是通过提示LLM从VPA应用的输出中提取状态，并生成上下文相关的输入。在与应用自动交互的过程中，该框架逐步构建行为模型，这有助于LLM生成极有可能发现新状态的输入。Elevate通过创新技术（如将行为模型编码至提示中、基于上下文相关性筛选LLM生成的输入）实现了LLM与行为模型的桥接。我们在4,000个真实世界的Alexa技能上对Elevate进行了基准测试，并与当前最先进的测试工具Vitas进行对比。结果显示，在所有类型的应用上，Elevate的状态空间覆盖率比Vitas高出15%，并在测试效率上展现出显著提升。