Automated bots now account for roughly half of all web requests, and an increasing number deliberately spoof their identity to either evade detection or to not respect robots.txt. Existing countermeasures are either resource-intensive (JavaScript challenges, CAPTCHAs), cost-prohibitive (commercial solutions), or degrade the user experience. This paper proposes a lightweight, passive approach to bot detection that combines user-agent string analysis with favicon-based heuristics, operating entirely on standard web server logs with no client-side interaction. We evaluate the method on over 4.6 million requests containing 54,945 unique user-agent strings collected from website hosted all around the earth. Our approach detects 67.7% of bot traffic while maintaining a false-positive rate of 3%, outperforming state of the art (less than 20%). This method can serve as a first line of defence, routing only genuinely ambiguous requests to active challenges and preserving the experience of legitimate users.
翻译:自动化机器人目前约占所有网络请求的一半,且越来越多的机器人故意伪装身份以逃避检测或不遵守robots.txt协议。现有防御手段要么资源密集(JavaScript挑战、验证码),要么成本高昂(商业解决方案),或者影响用户体验。本文提出一种轻量级、被动的机器人检测方法,结合用户代理字符串分析与基于网站图标的启发式规则,完全基于标准网络服务器日志运行,无需客户端交互。我们在全球托管网站收集的超过460万次请求(包含54,945个独特用户代理字符串)上评估了该方法。该方法可检测出67.7%的机器人流量,同时保持3%的假阳性率,优于现有技术(低于20%)。该方法可作为第一道防线,仅将真正模糊的请求引导至主动挑战,从而保护合法用户的体验。