Millions of vulnerable consumer IoT devices in home networks are the enabler for cyber crimes putting user privacy and Internet security at risk. Internet service providers (ISPs) are best poised to play key roles in mitigating risks by automatically inferring active IoT devices per household and notifying users of vulnerable ones. Developing a scalable inference method that can perform robustly across thousands of home networks is a non-trivial task. This paper focuses on the challenges of developing and applying data-driven inference models when labeled data of device behaviors is limited and the distribution of data changes (concept drift) across time and space domains. Our contributions are three-fold: (1) We collect and analyze network traffic of 24 types of consumer IoT devices from 12 real homes over six weeks to highlight the challenge of temporal and spatial concept drifts in network behavior of IoT devices; (2) We analyze the performance of two inference strategies, namely "global inference" (a model trained on a combined set of all labeled data from training homes) and "contextualized inference" (several models each trained on the labeled data from a training home) in the presence of concept drifts; and (3) To manage concept drifts, we develop a method that dynamically applies the ``closest'' model (from a set) to network traffic of unseen homes during the testing phase, yielding better performance in 20% of scenarios.
翻译:数百万家庭网络中脆弱的消费物联网设备是网络犯罪的催化剂,危及用户隐私和互联网安全。互联网服务提供商(ISP)最适合通过自动推断每个家庭中活跃的物联网设备并通知用户存在漏洞的设备来发挥关键作用。开发一种能在数千个家庭网络中稳健运行的可扩展推理方法是一项艰巨任务。本文聚焦于在设备行为标注数据有限且数据分布随时间与空间域变化(概念漂移)时,开发和应用数据驱动推理模型所面临的挑战。我们的贡献有三方面:(1)我们收集并分析了来自12个真实家庭中24种消费物联网设备在六周内的网络流量,以突出物联网设备网络行为中时间与空间概念漂移的挑战;(2)我们分析了两种推理策略——即“全局推理”(基于训练家庭所有标注数据的联合集训练的模型)和“情境化推理”(基于每个训练家庭标注数据分别训练的多个模型)——在概念漂移存在下的性能;(3)为管理概念漂移,我们开发了一种方法,在测试阶段动态应用与未见家庭网络流量“最匹配”的模型(从一组模型中选取),在20%的场景中取得了更优性能。