Google Trends reports how frequently specific queries are searched on Google over time. It is widely used in research and industry to gain early insights into public interest. However, its data generation mechanism introduces missing values, sampling variability, noise, and trends. These issues arise from privacy thresholds mapping low search volumes to zeros, daily sampling variations causing discrepancies across historical downloads, and algorithm updates altering volume magnitudes over time. Data quality has recently deteriorated, with more zeros and noise, even for previously stable queries. We propose a comprehensive statistical methodology to preprocess Google Trends search information using hierarchical clustering, smoothing splines, and detrending. We validate our approach by forecasting U.S. influenza hospitalizations up to three weeks ahead with several statistical and machine learning models. Compared to omitting exogenous variables, our results show that preprocessed signals enhance forecast accuracy, while raw Google Trends data often degrades performance in statistical models.
翻译:谷歌趋势报告了特定查询在谷歌上随时间变化的搜索频率。该工具在研究和工业领域被广泛用于获取公众兴趣的早期洞察。然而,其数据生成机制引入了缺失值、抽样变异性、噪声和趋势。这些问题源于将低搜索量映射为零的隐私阈值、导致历史下载数据间差异的每日抽样变化,以及随时间改变搜索量级的算法更新。近期数据质量有所恶化,即使对于先前稳定的查询,也出现了更多的零值和噪声。我们提出了一种综合统计方法,利用层次聚类、平滑样条和去趋势技术对谷歌趋势搜索信息进行预处理。我们通过使用多种统计和机器学习模型,对美国流感住院人数进行长达三周的提前预测,验证了我们的方法。与忽略外生变量的情况相比,我们的结果表明,经过预处理的信号提高了预测准确性,而原始的谷歌趋势数据在统计模型中往往会降低预测性能。