CAPTCHAs are widely used by websites to block bots and spam by presenting challenges that are easy for humans but difficult for automated programs to solve. To improve accessibility, audio CAPTCHAs are designed to complement visual ones. However, the robustness of audio CAPTCHAs against advanced Large Audio Language Models (LALMs) and Automatic Speech Recognition (ASR) models remains unclear. In this paper, we introduce AI-CAPTCHA, a unified framework that offers (i) an evaluation framework, ACEval, which includes advanced LALM- and ASR-based solvers, and (ii) a novel audio CAPTCHA approach, IllusionAudio, leveraging audio illusions. Through extensive evaluations of seven widely deployed audio CAPTCHAs, we show that most existing methods can be solved with high success rates by advanced LALMs and ASR models, exposing critical security weaknesses. To address these vulnerabilities, we design a new audio CAPTCHA approach, IllusionAudio, which exploits perceptual illusion cues rooted in human auditory mechanisms. Extensive experiments demonstrate that our method defeats all tested LALM- and ASR-based attacks while achieving a 100% human pass rate, significantly outperforming existing audio CAPTCHA methods.
翻译:CAPTCHA(全自动区分计算机和人类的公开图灵测试)被网站广泛用于通过呈现人类易于解决但自动化程序难以应对的挑战来阻挡机器人和垃圾信息。为提高可访问性,音频CAPTCHA被设计用于补充视觉方案。然而,音频CAPTCHA在面对先进的大型音频语言模型和自动语音识别模型时的鲁棒性仍不明确。本文提出了AI-CAPTCHA统一框架,该框架包含:(i)评估框架ACEval,集成了基于先进LALM和ASR的求解器;(ii)新型音频CAPTCHA方法IllusionAudio,该方法利用音频错觉原理。通过对七种广泛部署的音频CAPTCHA进行大规模评估,我们发现现有方法大多能被先进的LALM和ASR模型以高成功率破解,暴露出严重的安全缺陷。为应对这些漏洞,我们设计了新型音频CAPTCHA方法IllusionAudio,该方法利用植根于人类听觉机制的感知错觉线索。大量实验表明,我们的方法能抵御所有已测试的基于LALM和ASR的攻击,同时实现100%的人类通过率,显著优于现有音频CAPTCHA方案。