While the increasing number of Vantage Points (VPs) in RIPE RIS and RouteViews improves our understanding of the Internet, the quadratically increasing volume of collected data poses a challenge to the scientific and operational use of the data. The design and implementation of BGP and BGP data collection systems lead to data archives with enormous redundancy, as there is substantial overlap in announced routes across many different VPs. Researchers thus often resort to arbitrary sampling of the data, which we demonstrate comes at a cost to the accuracy and coverage of previous works. The continued growth of the Internet, and of these collection systems, exacerbates this cost. The community needs a better approach to managing and using these data archives. We propose MVP, a system that scores VPs according to their level of redundancy with other VPs, allowing more informed sampling of these data archives. Our challenge is that the degree of redundancy between two updates depends on how we define redundancy, which in turn depends on the analysis objective. Our key contribution is a general framework and associated algorithms to assess redundancy between VP observations. We quantify the benefit of our approach for four canonical BGP routing analyses: AS relationship inference, AS rank computation, hijack detection, and routing detour detection. MVP improves the coverage or accuracy (or both) of all these analyses while processing the same volume of data.
翻译:尽管RIPE RIS和RouteViews中观测点数量的增加加深了我们对互联网的理解,但采集数据量的二次增长对数据的科研与运维应用构成了挑战。BGP及其数据采集系统的设计与实现导致数据档案存在巨大冗余,因为众多不同观测点宣告的路由存在大量重叠。研究者因此常采用对数据的任意抽样,我们证明这种做法会损害既有研究的准确性与覆盖范围。互联网及这些采集系统的持续发展加剧了这一代价。学界需要更优的方法来管理与使用这些数据档案。我们提出MVP系统,该系统根据观测点与其他观测点的冗余程度对其进行评分,从而实现对数据档案的更明智抽样。我们面临的挑战在于:两个路由更新间的冗余程度取决于如何定义冗余,而这又取决于分析目标。我们的核心贡献是评估观测点间观测冗余度的通用框架及相应算法。我们量化了该方法在四种经典BGP路由分析中的优势:AS关系推断、AS等级计算、劫持检测和路由绕道检测。在处相同数据量的前提下,MVP提升了所有分析任务的覆盖范围或准确性(或两者兼得)。