Restoring the Forecasting Power of Google Trends with Statistical Preprocessing

Google Trends reports how frequently specific queries are searched on Google over time. It is widely used in research and industry to gain early insights into public interest. However, its data generation mechanism introduces missing values, sampling variability, noise, and trends. These issues arise from privacy thresholds mapping low search volumes to zeros, daily sampling variations causing discrepancies across historical downloads, and algorithm updates altering volume magnitudes over time. Data quality has recently deteriorated, with more zeros and noise, even for previously stable queries. We propose a comprehensive statistical methodology to preprocess Google Trends search information using hierarchical clustering, smoothing splines, and detrending. We validate our approach by forecasting U.S. influenza hospitalizations up to three weeks ahead with several statistical and machine learning models. Compared to omitting exogenous variables, our results show that preprocessed signals enhance forecast accuracy, while raw Google Trends data often degrades performance in statistical models.

翻译：谷歌趋势报告了特定查询在谷歌上随时间变化的搜索频率。该工具在研究和工业领域被广泛用于获取公众兴趣的早期洞察。然而，其数据生成机制引入了缺失值、抽样变异性、噪声和趋势。这些问题源于将低搜索量映射为零的隐私阈值、导致历史下载数据间差异的每日抽样变化，以及随时间改变搜索量级的算法更新。近期数据质量有所恶化，即使对于先前稳定的查询，也出现了更多的零值和噪声。我们提出了一种综合统计方法，利用层次聚类、平滑样条和去趋势技术对谷歌趋势搜索信息进行预处理。我们通过使用多种统计和机器学习模型，对美国流感住院人数进行长达三周的提前预测，验证了我们的方法。与忽略外生变量的情况相比，我们的结果表明，经过预处理的信号提高了预测准确性，而原始的谷歌趋势数据在统计模型中往往会降低预测性能。

相关内容

Google

关注 77

一家美国的跨国科技企业，致力于互联网搜索、云计算、广告技术等领域，由当时在斯坦福大学攻读理学博士的拉里·佩奇和谢尔盖·布林共同创建。创始之初，Google 官方的公司使命为「整合全球范围的信息，使人人皆可访问并从中受益」。 Google 开发并提供了大量基于互联网的产品与服务，其主要利润来自于 AdWords 等广告服务。

2004 年 8 月 19 日，公司以「GOOG」为代码正式登陆纳斯达克交易所。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日