Modeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine learning models are limited in size due to the costly and time-consuming process of manually collecting and annotating UIs. We crawled the web to construct WebUI, a large dataset of 400,000 rendered web pages associated with automatically extracted metadata. We analyze the composition of WebUI and show that while automatically extracted data is noisy, most examples meet basic criteria for visual UI modeling. We applied several strategies for incorporating semantics found in web pages to increase the performance of visual UI understanding models in the mobile domain, where less labeled data is available: (i) element detection, (ii) screen classification and (iii) screen similarity.
翻译:通过视觉信息对用户界面(UI)进行建模,使系统能够推断支持可访问性、应用自动化及测试用例所需的功能与语义。当前用于训练机器学习模型的数据集因手动收集和标注UI的高昂成本与耗时过程而规模受限。我们通过网络爬取构建了WebUI——一个包含40万个渲染网页及其自动提取元数据的大型数据集。我们分析了WebUI的构成,并表明尽管自动提取的数据存在噪声,但多数样本满足视觉UI建模的基本标准。我们采用了多种策略融合网页中的语义信息,以提升移动领域(该领域标注数据较少)视觉UI理解模型的性能:(i)元素检测、(ii)屏幕分类以及(iii)屏幕相似性。