GUing: A Mobile GUI Search Engine using a Vision-Language Model

App developers use the Graphical User Interface (GUI) of other apps as an important source of inspiration to design and improve their own apps. In recent years, research suggested various approaches to retrieve GUI designs that fit a certain text query from screenshot datasets acquired through automated GUI exploration. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements in the screenshots, neglecting visual information such as icons or background images. In addition, the retrieved screenshots are not steered by app developers and often lack important app features, e.g. whose UI pages require user authentication. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called UIClip, which we trained specifically for the app GUI domain. For this, we first collected app introduction images from Google Play, which usually display the most representative screenshots selected and often captioned (i.e. labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This finally results in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of UIClip for other GUI tasks including GUI classification and Sketch-to-GUI retrieval with encouraging results.

翻译：应用程序开发者将其他应用的图形用户界面（GUI）作为设计和改进自身应用的重要灵感来源。近年来，研究者提出了多种方法，通过自动化GUI探索获取的截图数据集，检索符合特定文本查询的GUI设计。然而，这类文本到GUI的检索方法仅利用了截图中GUI元素的文本信息，忽略了图标或背景图像等视觉信息。此外，检索结果常缺乏应用开发者引导，且遗漏重要应用特性（例如需要用户身份认证的UI页面）。为克服这些局限，本文提出GUing——一种基于名为UIClip的视觉-语言模型的GUI搜索引擎，该模型针对应用GUI领域进行了专门训练。为此，我们首先从Google Play收集应用介绍图片，这些图片通常展示应用开发者精选且常附有标题（即标注）的代表性截图。继而开发自动化流水线对这些图片进行分类、裁剪及标题提取，最终构建了一个包含303k张应用截图（其中135k张带有标题）的大规模数据集，并与本文一同公开。我们利用该数据集训练了新型视觉-语言模型，据我们所知，这是GUI检索领域的首次尝试。通过在相关工作的多个数据集及人工实验中的评估，结果表明我们的模型在文本到GUI检索任务中优于现有方法，Recall@10达0.69，HIT@10达0.91。此外，我们在GUI分类及草图到GUI检索等其他任务中探索了UIClip的性能，取得了令人鼓舞的结果。