HLL: Can Agents Cross Humanity's Last Line of Verification?

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL

翻译：多模态智能体正越来越多地被期望代表用户操作界面，这引发了一个核心部署问题：在服务提供商刻意针对自动化进行保护的流程中，它们能否真正替代人类？验证码（CAPTCHA）验证使得这一问题变得具体。它不仅仅是一个视觉谜题，更是在账户创建、内容访问、表单提交及其他受保护操作之前设置的一道人类验证边界。我们提出了**人类验证的最后一道防线（HLL）**，这是一个受控基准测试，利用交互式验证码验证来评估智能体能否通过基于环境的、类人的交互（而不仅仅是识别）来跨越这一边界。HLL涵盖了多种验证码交互方式，并让智能体暴露于受控的现实压力因素之下，包括杂乱的网页、更困难的任务变体以及基于求解轨迹的验证。我们在一个闭环图形用户界面环境中评估了八个前沿的多模态智能体。结果表明，当前智能体在这一人类替代边界上仍然脆弱：其性能在不同验证类型间差异显著，在现实界面条件下会下降，并且当正确答案必须由有效的动作轨迹支持时，性能会进一步降低。通过揭示其在定位、动作校准、状态跟踪及过程一致性方面的不足，HLL为衡量多模态智能体在受保护的真实世界流程中替代人类的能力有多接近，提供了一个具体的测试平台。我们的代码可在 https://github.com/XinhaoS0101/HLL 获取。