Ensuring web accessibility at scale remains challenging because rule-based tools provide limited coverage while manual remediation is costly and error-prone. This paper evaluates large language model based agents, specifically Kimi K2.5, for automated accessibility detection and repair compared with rule-based approaches. For detection, the LLM achieves performance comparable to rule-based tools, with F1 around 0.65, strong semantic understanding with F1 of 0.83, but lower reliability for syntactic and layout-related violations. For remediation, LLM-generated fixes are syntactically valid in over 99.7 percent of cases and improve accessibility compliance in 80.2 percent of instances, reducing violations from 3.98 to 1.7 per file. However, fewer than 26 percent of cases are fully resolved, and about 30 percent of patches introduce structural changes. We also find that iterative agent-based refinement increases computational cost by 52 percent and API usage by 1.64 times without improving remediation outcomes. These findings indicate that while LLMs are effective for partial accessibility repair, they are insufficient for complete and reliable remediation. Scalable accessibility solutions require hybrid approaches that combine LLM capabilities with rule-based validation and constraint-aware correction mechanisms.
翻译:大规模保障网页无障碍性仍然面临挑战,因为基于规则的工具覆盖范围有限,而人工修复既昂贵又容易出错。本文评估了基于大语言模型的智能体(特别是Kimi K2.5)在自动化无障碍检测与修复方面的性能,并与基于规则的方法进行了比较。在检测方面,大语言模型的性能与基于规则的工具相当,F1值约为0.65;在语义理解上表现强劲,F1值达到0.83;但在句法和布局相关违规检测方面可靠性较低。在修复方面,大语言模型生成的修复方案在超过99.7%的情况下语法有效,并在80.2%的实例中提升了无障碍合规性,将每个文件的违规数量从3.98个减少到1.7个。然而,完全修复的案例不足26%,且约30%的补丁引入了结构变更。我们还发现,迭代式的智能体优化会使计算成本增加52%,API使用量增加1.64倍,但并未改善修复效果。这些发现表明,虽然大语言模型在部分无障碍修复方面有效,但尚不足以实现完全且可靠的修复。可扩展的无障碍解决方案需要结合大语言模型能力与基于规则的验证及约束感知校正机制的混合方法。