Machine learning is widely used for malware detection in practice. Prior behavior-based detectors most commonly rely on traces of programs executed in controlled sandboxes. However, sandbox traces are unavailable to the last line of defense offered by security vendors: malware detection at endpoints. A detector at endpoints consumes the traces of programs running on real-world hosts, as sandbox analysis might introduce intolerable delays. Despite their success in the sandboxes, research hints at potential challenges for ML methods at endpoints, e.g., highly variable malware behaviors. Nonetheless, the impact of these challenges on existing approaches and how their excellent sandbox performance translates to the endpoint scenario remain unquantified. We present the first measurement study of the performance of ML-based malware detectors at real-world endpoints. Leveraging a dataset of sandbox traces and a dataset of in-the-wild program traces; we evaluate two scenarios where the endpoint detector was trained on (i) sandbox traces (convenient and accessible); and (ii) endpoint traces (less accessible due to needing to collect telemetry data). This allows us to identify a wide gap between prior methods' sandbox-based detection performance--over 90%--and endpoint performances--below 20% and 50% in (i) and (ii), respectively. We pinpoint and characterize the challenges contributing to this gap, such as label noise, behavior variability, or sandbox evasion. To close this gap, we propose that yield a relative improvement of 5-30% over the baselines. Our evidence suggests that applying detectors trained on sandbox data to endpoint detection -- scenario (i) -- is challenging. The most promising direction is training detectors on endpoint data -- scenario (ii) -- which marks a departure from widespread practice. We implement a leaderboard for realistic detector evaluations to promote research.
翻译:机器学习在实际中广泛用于恶意软件检测。以往基于行为的检测器大多依赖在受控沙箱中执行的程序痕迹。然而,沙箱痕迹对于安全厂商提供的最后防线——端点恶意软件检测——是不可用的。端点检测器消耗的是真实主机上运行的程序痕迹,因为沙箱分析可能会引入不可容忍的延迟。尽管这些方法在沙箱中取得成功,但研究指出机器学习方法在端点可能面临挑战,例如恶意软件行为的极大变异性。然而,这些挑战对现有方法的影响,以及其卓越的沙箱性能如何转化为端点场景的表现,尚未被量化。我们首次对基于机器学习的恶意软件检测器在真实端点的性能进行了测量研究。利用沙箱痕迹数据集和野外程序痕迹数据集,我们评估了两种场景:端点检测器在(i)沙箱痕迹(方便且可获取)和(ii)端点痕迹(因需要收集遥测数据而较难获取)上训练的情况。这使我们发现先前方法基于沙箱的检测性能(超过90%)与端点性能之间存在巨大差距——在(i)和(ii)场景中分别低于20%和50%。我们确定并描述了导致这一差距的挑战,如标签噪声、行为变异性或沙箱规避。为缩小这一差距,我们提出的方法在基线上实现了5-30%的相对改进。我们的证据表明,将在沙箱数据上训练的检测器应用于端点检测(场景(i))具有挑战性。最有前景的方向是在端点数据(场景(ii))上训练检测器,这标志着对广泛实践的偏离。我们实施了一个用于现实检测评估的排行榜,以促进研究。