Analyzing Disparity and Temporal Progression of Internet Quality through Crowdsourced Measurements with Bias-Correction

Crowdsourced speedtest measurements are an important tool for studying internet performance from the end user perspective. Nevertheless, despite the accuracy of individual measurements, simplistic aggregation of these data points is problematic due to their intrinsic sampling bias. In this work, we utilize a dataset of nearly 1 million individual Ookla Speedtest measurements, correlate each datapoint with 2019 Census demographic data, and develop new methods to present a novel analysis to quantify regional sampling bias and the relationship of internet performance to demographic profile. We find that the crowdsourced Ookla Speedtest data points contain significant sampling bias across different census block groups based on a statistical test of homogeneity. We introduce two methods to correct the regional bias by the population of each census block group. Whereas the sampling bias leads to a small discrepancy in the overall cumulative distribution function of internet speed in a city between estimation from original samples and bias-corrected estimation, the discrepancy is much smaller compared to the size of the sampling heterogeneity across regions. Further, we show that the sampling bias is strongly associated with a few demographic variables, such as income, education level, age, and ethnic distribution. Through regression analysis, we find that regions with higher income, younger populations, and lower representation of Hispanic residents tend to measure faster internet speeds along with substantial collinearity amongst socioeconomic attributes and ethnic composition. Finally, we find that average internet speed increases over time based on both linear and nonlinear analysis from state space models, though the regional sampling bias may result in a small overestimation of the temporal increase of internet speed.

翻译：众包测速测量是从终端用户视角研究互联网性能的重要工具。然而，尽管单次测量具有准确性，但由于其固有的采样偏差，对这些数据点进行简单聚合会引发问题。本研究利用包含近100万次个人Ookla Speedtest测量的数据集，将每个数据点与2019年人口普查人口统计数据进行关联，开发新方法并提出新型分析框架，以量化区域采样偏差及互联网性能与人口统计特征之间的关系。基于同质性统计检验，我们发现众包Ookla Speedtest数据点在不同人口普查区块组之间存在显著采样偏差。我们提出两种方法，通过各人口普查区块组的人口规模校正区域偏差。尽管采样偏差导致城市互联网速度整体累积分布函数在原始样本估计与偏差校正估计之间存在微小差异，但该差异远小于区域间采样异质性的规模。进一步研究表明，采样偏差与收入、教育水平、年龄及种族分布等少数人口统计变量高度相关。通过回归分析，我们发现高收入、年轻人口及西班牙裔居民比例较低的区域通常测得更快的互联网速度，且社会经济属性与种族构成之间存在显著共线性。最后，基于状态空间模型的线性和非线性分析均显示平均互联网速度随时间增长，但区域采样偏差可能导致互联网速度时序增长被轻微高估。