In this note we use the State of the Union Address (SOTU) dataset from Kaggle to make some surprising (and some not so surprising) observations pertaining to the general timeline of American history, and the character and nature of the addresses themselves. Our main approach is using vector embeddings, such as BERT (DistilBERT) and GPT-2. While it is widely believed that BERT (and its variations) is most suitable for NLP classification tasks, we find out that GPT-2 in conjunction with nonlinear dimension reduction methods such as UMAP provide better separation and stronger clustering. This makes GPT-2 + UMAP an interesting alternative. In our case, no model fine-tuning is required, and the pre-trained out-of-the-box GPT-2 model is enough. We also used a fine-tuned DistilBERT model for classification detecting which President delivered which address, with very good results (accuracy 93\% - 95\% depending on the run). An analogous task was performed to determine the year of writing, and we were able to pin it down to about 4 years (which is a single presidential term). It is worth noting that SOTU addresses provide relatively small writing samples (with about 8000 words on average, and varying widely from under 2000 words to more than 20000), and that the amount of authors is relatively large (we used SOTU addresses of 42 US presidents). This shows that the techniques employed turn out to be rather efficient, while all the computations described in this note can be performed using a single GPU instance of Google Colab. The accompanying code is available on GitHub.
翻译:本文利用Kaggle上的国情咨文(SOTU)数据集,针对美国历史总体时间线以及咨文本身特征与性质,提出了一些令人惊讶(以及部分意料之中)的观察。我们的主要方法基于向量嵌入技术,包括BERT(DistilBERT)和GPT-2。尽管普遍认为BERT(及其变体)最适用于自然语言处理分类任务,但我们发现GPT-2结合非线性降维方法(如UMAP)能够实现更优的分离效果与更强的聚类特性。这使得GPT-2 + UMAP成为一种值得关注的替代方案。在本研究中,无需进行模型微调,预训练的原始GPT-2模型即可胜任。我们还使用微调后的DistilBERT模型进行分类任务,以识别每篇咨文对应的总统,取得了极佳效果(准确率因运行批次不同介于93%至95%之间)。我们针对写作年份进行了类似分析,成功将年份误差缩小至约4年(即一个总统任期)。值得注意的是,国情咨文提供的文本样本量相对较小(平均约8000字,且范围跨度极大——从不足2000字到超过20000字),作者数量却相对较多(我们使用了42位美国总统的国情咨文)。这表明本文采用的技术具有显著高效性,而文中描述的所有计算工作均可通过单个Google Colab GPU实例完成。配套代码已发布于GitHub。