Accurate observation of dynamic environments traditionally relies on synthesizing raw, signal-level information from multiple distributed sensors. This work investigates an alternative approach: performing geospatial inference using only encrypted packet-level information, without access to the raw sensory data. We further explore how this indirect information can be fused with directly available sensory data to extend overall inference capabilities. We introduce GraySense, a learning-based framework that performs geospatial object tracking by analyzing encrypted wireless video transmission traffic, such as packet sizes, from cameras with inaccessible streams. GraySense leverages the inherent relationship between scene dynamics and transmitted packet sizes to infer object motion. The framework consists of two stages: (1) a Packet Grouping module that identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) a Tracker module, based on a Transformer encoder with a recurrent state, which fuses indirect packet-based inputs with optional direct camera-based inputs to estimate the object's position. Extensive experiments with realistic videos from the CARLA simulator and emulated networks under varying conditions show that GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within the dimensions of tracked objects (4.61m x 1.93m). To our knowledge, this capability has not been previously demonstrated, expanding the use of latent signals for sensing.
翻译:精确观测动态环境传统上依赖于综合多个分布式传感器的原始信号级信息。本研究探索另一种方法:仅利用加密的分组级信息进行地理空间推理,无需访问原始传感数据。我们进一步研究如何将这种间接信息与可直接获取的传感数据融合,以扩展整体推理能力。我们提出GraySense,一种基于学习的框架,通过分析加密的无线视频传输流量(如数据包大小)对不可获取流的摄像头进行地理空间目标跟踪。GraySense利用场景动态与传输数据包大小之间的内在关联来推断物体运动。该框架包含两个阶段:(1)分组模块,从加密网络流量中识别帧边界并估算帧大小;(2)跟踪模块,基于带循环状态的Transformer编码器,融合基于分组的间接输入与可选基于摄像头的直接输入,以估计目标位置。使用CARLA模拟器生成的逼真视频及不同条件下模拟网络的广泛实验表明,GraySense在无原始信号访问的情况下,在跟踪目标尺寸(4.61米×1.93米)范围内实现了2.33米的跟踪误差(欧氏距离)。据我们所知,这一能力此前尚未被证明,拓展了潜在信号在感知中的应用。