Crowdsourced speedtest measurements are an important tool for studying internet performance from the end user perspective. Nevertheless, despite the accuracy of individual measurements, simplistic aggregation of these data points is problematic due to their intrinsic sampling bias. In this work, we utilize a dataset of nearly 1 million individual Ookla Speedtest measurements, correlate each datapoint with 2019 Census demographic data, and develop new methods to present a novel analysis to quantify regional sampling bias and the relationship of internet performance to demographic profile. We find that the crowdsourced Ookla Speedtest data points contain significant sampling bias across different census block groups based on a statistical test of homogeneity. We introduce two methods to correct the regional bias by the population of each census block group. Whereas the sampling bias leads to a small discrepancy in the overall cumulative distribution function of internet speed in a city between estimation from original samples and bias-corrected estimation, the discrepancy is much smaller compared to the size of the sampling heterogeneity across regions. Further, we show that the sampling bias is strongly associated with a few demographic variables, such as income, education level, age, and ethnic distribution. Through regression analysis, we find that regions with higher income, younger populations, and lower representation of Hispanic residents tend to measure faster internet speeds along with substantial collinearity amongst socioeconomic attributes and ethnic composition. Finally, we find that average internet speed increases over time based on both linear and nonlinear analysis from state space models, though the regional sampling bias may result in a small overestimation of the temporal increase of internet speed.
翻译:众包网速测试测量是从终端用户视角研究互联网性能的重要工具。然而,尽管单次测量具备准确性,但由于其固有的采样偏差,对这些数据点的简单聚合存在显著问题。本研究利用包含近100万个Ookla Speedtest单次测量的数据集,将每个数据点与2019年人口普查数据关联,开发新方法以提出创新性分析,量化区域采样偏差及互联网性能与人口特征的关系。通过同质性统计检验,我们发现众包Ookla Speedtest数据点在不同人口普查区块组间存在显著采样偏差。我们提出两种按各区块组人口数量校正区域偏差的方法。尽管采样偏差导致城市互联网速度总体累积分布函数在原始样本估计与偏差校正估计之间存在较小差异,但该差异远小于跨区域采样异质性的规模。进一步研究表明,采样偏差与收入、教育水平、年龄及民族分布等少数人口变量紧密相关。通过回归分析发现,高收入、年轻人口比例高及西班牙裔居民占比低的区域测量网速更快,且社会经济属性与民族构成之间存在显著共线性。最后,基于状态空间模型的线性与非线性分析均表明平均网速随时间递增,但区域采样偏差可能导致网速时序增长量被小幅高估。