Large language models (LLMs) that have been trained on a corpus that includes large amount of code exhibit a remarkable ability to understand HTML code. As web interfaces are primarily constructed using HTML, we design an in-depth study to see how LLMs can be used to retrieve and locate important elements for a user given query (i.e. task description) in a web interface. In contrast with prior works, which primarily focused on autonomous web navigation, we decompose the problem as an even atomic operation - Can LLMs identify the important information in the web page for a user given query? This decomposition enables us to scrutinize the current capabilities of LLMs and uncover the opportunities and challenges they present. Our empirical experiments show that while LLMs exhibit a reasonable level of performance in retrieving important UI elements, there is still a substantial room for improvement. We hope our investigation will inspire follow-up works in overcoming the current challenges in this domain.
翻译:在包含大量代码的语料库上训练的大语言模型展现出理解HTML代码的惊人能力。由于网页界面主要基于HTML构建,我们设计了一项深度研究,探究如何利用LLM根据用户给定查询(即任务描述)在网页界面中检索并定位关键元素。与先前主要聚焦于自主网页导航的研究不同,我们将该问题分解为更原子的操作——LLM能否根据用户查询识别网页中的重要信息?这种分解使我们能够深入审视LLM的当前能力,揭示其带来的机遇与挑战。实证实验表明,尽管LLM在检索重要UI元素方面展现出合理水平的表现,但仍有显著的改进空间。我们希望这项研究能激发后续工作,共同攻克该领域当前面临的挑战。