Web applications rely heavily on hyperlinks to connect disparate information resources. However, the dynamic nature of the web leads to link rot, where targets become unavailable, and more insidiously, semantic drift, where a valid HTTP 200 connection exists, but the target content no longer aligns with the source context. Traditional verification tools, which primarily function as crash oracles by checking HTTP status codes, often fail to detect semantic inconsistencies, thereby compromising web integrity and user experience. While Large Language Models (LLMs) offer semantic understanding, they suffer from high latency, privacy concerns, and prohibitive costs for large-scale regression testing. In this paper, we propose SemLink, a novel automated test oracle for semantic hyperlink verification. SemLink leverages a Siamese Neural Network architecture powered by a pre-trained Sentence-BERT (SBERT) backbone to compute the semantic coherence between a hyperlink's source context (anchor text, surrounding DOM elements, and visual features) and its target page content. To train and evaluate our model, we introduce the Hyperlink-Webpage Positive Pairs (HWPPs) dataset, a rigorously constructed corpus of over 60,000 semantic pairs. Our evaluation demonstrates that SemLink achieves a Recall of 96.00%, comparable to state-of-the-art LLMs (GPT-5.2), while operating approximately 47.5 times faster and requiring significantly fewer computational resources. This work bridges the gap between traditional syntactic checkers and expensive generative AI, offering a robust and efficient solution for automated web quality assurance.
翻译:[translated abstract in Chinese]
网络应用高度依赖超链接来连接分散的信息资源。然而,网络的动态特性导致链接失效(目标不可用),以及更为隐蔽的语义漂移(存在有效的HTTP 200连接,但目标内容不再与源上下文匹配)。传统验证工具主要作为崩溃预言机,通过检查HTTP状态码来工作,往往无法检测语义不一致性,从而损害网络完整性和用户体验。尽管大型语言模型(LLMs)具备语义理解能力,但在大规模回归测试中,它们存在高延迟、隐私问题以及高昂成本等缺陷。本文提出SemLink,一种新颖的自动测试预言机,用于语义超链接验证。SemLink利用基于预训练Sentence-BERT(SBERT)骨干网络的孪生神经网络架构,计算超链接源上下文(锚文本、周围DOM元素及视觉特征)与目标页面内容之间的语义一致性。为训练和评估模型,我们引入了超链接-网页正对(HWPPs)数据集——一个严格构建的包含超过60,000个语义对的语料库。评估结果表明,SemLink的召回率达到96.00%,与最先进的LLM(GPT-5.2)相当,同时运行速度提高约47.5倍,且所需计算资源显著减少。本研究弥合了传统语法检查方法与昂贵生成式AI之间的鸿沟,为自动化网络质量保障提供了稳健高效的解决方案。