Automated traffic continued to surpass human-generated traffic on the web, and a rising proportion of this automation was explicitly malicious. Evasive bots could pretend to be real users, even solve Captchas and mimic human interaction patterns. This work explores a less intrusive, protocol-level method: using TLS fingerprinting with the JA4 technique to tell apart bots from real users. Two gradient-boosted machine learning classifiers (XGBoost and CatBoost) were trained and evaluated on a dataset of real TLS fingerprints (JA4DB) after feature extraction, which derived informative signals from JA4 fingerprints that describe TLS handshake parameters. The CatBoost model performed better, achieving an AUC of 0.998 and an F1 score of 0.9734. It was accurate 0.9863 of the time on the test set. The XGBoost model showed almost similar results. Feature significance analyses identified JA4 components, especially ja4\_b, cipher\_count, and ext\_count, as the most influential on model effectiveness. Future research will extend this method to new protocols, such as HTTP/3, and add additional device-fingerprinting features to test how well the system resists advanced bot evasion tactics.
翻译:网络自动化流量持续超越人工生成流量,其中恶意自动化流量的比例日益攀升。规避型机器人能够伪装成真实用户,甚至可破解验证码并模拟人类交互模式。本研究探索了一种侵入性较低、协议层面的检测方法:采用JA4技术进行TLS指纹识别以区分机器人与真实用户。通过对真实TLS指纹数据集(JA4DB)进行特征提取——该过程从描述TLS握手参数的JA4指纹中提取信息特征,我们训练并评估了两种梯度提升机器学习分类器(XGBoost与CatBoost)。CatBoost模型表现更优,其AUC达到0.998,F1分数为0.9734,在测试集上的准确率为0.9863。XGBoost模型展现出近乎相当的性能。特征重要性分析表明JA4组件(特别是ja4_b、cipher_count和ext_count)对模型效能影响最为显著。未来研究将把该方法拓展至HTTP/3等新协议,并整合更多设备指纹特征,以测试系统对抗高级机器人规避策略的能力。