Less is More: Optimizing Probe Selection Using Shared Latency Anomalies

Latency anomalies, defined as persistent or transient increases in round-trip time (RTT), are common in residential Internet performance. When multiple users observe anomalies to the same destination, this may reflect shared infrastructure, routing behavior, or congestion. Inferring such shared behavior is challenging because anomaly magnitudes vary widely across devices, even within the same ISP and geographic area, and detailed network topology information is often unavailable. We study whether devices experiencing a shared latency anomaly observe similar changes in RTT magnitude using a topology-agnostic approach. Using four months of high-frequency RTT measurements from 99 residential probes in Chicago, we detect shared anomalies and analyze their consistency in amplitude and duration without relying on traceroutes or explicit path information. Building on prior change-point detection techniques, we find that many shared anomalies exhibit similar amplitude across users, particularly within the same ISP. Motivated by this observation, we design a sampling algorithm that reduces redundancy by selecting representative devices under user-defined constraints. Our approach captures 95 percent of aggregate anomaly impact using fewer than half of the deployed probes. Compared to two baselines, it identifies significantly more unique anomalies at comparable coverage levels. We further show that geographic diversity remains important when selecting probes within a single ISP, even at city scale. Overall, our results demonstrate that anomaly amplitude and duration provide effective topology-independent signals for scalable monitoring, troubleshooting, and cost-efficient sampling in residential Internet measurement.

翻译：延迟异常，定义为往返时间（RTT）的持续或瞬态增加，在住宅互联网性能中普遍存在。当多个用户观测到前往同一目的地的异常时，这可能反映了共享的基础设施、路由行为或拥塞。推断此类共享行为具有挑战性，因为即使在相同的ISP和地理区域内，异常幅度在不同设备间差异巨大，且详细的网络拓扑信息通常不可用。我们采用一种与拓扑无关的方法，研究经历共享延迟异常的设备是否观测到相似的RTT幅度变化。利用来自芝加哥99个住宅探针的四个月高频RTT测量数据，我们检测共享异常并分析其在幅度和持续时间上的一致性，而不依赖于traceroute或显式的路径信息。基于先前的变点检测技术，我们发现许多共享异常在用户间表现出相似的幅度，尤其是在同一ISP内部。受此观察启发，我们设计了一种在用户定义约束下选择代表性设备以减少冗余的采样算法。我们的方法使用不到一半的已部署探针，即可捕获95%的聚合异常影响。与两个基线方法相比，在可比的覆盖水平下，它能识别出显著更多的独特异常。我们进一步表明，即使在城市尺度内，在单个ISP内选择探针时，地理多样性仍然重要。总体而言，我们的结果表明，异常幅度和持续时间提供了有效的、与拓扑无关的信号，可用于住宅互联网测量中的可扩展监控、故障排除和成本效益采样。