We present an open-source Python library for building and using datasets where inputs are clusters of textual data, and outputs are sequences of real values representing one or more time series signals. The news-signals library supports diverse data science and NLP problem settings related to the prediction of time series behaviour using textual data feeds. For example, in the news domain, inputs are document clusters corresponding to daily news articles about a particular entity, and targets are explicitly associated real-valued time series: the volume of news about a particular person or company, or the number of pageviews of specific Wikimedia pages. Despite many industry and research use cases for this class of problem settings, to the best of our knowledge, News Signals is the only open-source library designed specifically to facilitate data science and research settings with natural language inputs and time series targets. In addition to the core codebase for building and interacting with datasets, we also conduct a suite of experiments using several popular Machine Learning libraries, which are used to establish baselines for time series anomaly prediction using textual inputs.
翻译:我们介绍一款开源Python库,用于构建并应用数据集,其中输入为文本数据簇,输出为表示一个或多个时间序列信号的实数值序列。news-signals库支持多种与利用文本数据流预测时间序列行为相关的数据科学及自然语言处理问题场景。例如,在新闻领域,输入为对应特定实体每日新闻文章的文本文档簇,目标值则为明确关联的实值时间序列:关于特定人物或企业的新闻数量,或特定维基百科页面的页面浏览量。尽管此类问题场景存在诸多行业与研究用例,但据我们所知,News Signals是唯一专为支持自然语言输入与时间序列目标的数据科学及研究场景而设计的开源库。除用于构建与操作数据集的核心代码库外,我们还利用多个主流机器学习库开展了一系列实验,为基于文本输入的时间序列异常预测建立了基线方法。