In this note we use the State of the Union Address (SOTU) dataset from Kaggle to make some surprising (and some not so surprising) observations pertaining to the general timeline of American history, and the character and nature of the addresses themselves. Our main approach is using vector embeddings, such as BERT (DistilBERT) and GPT-2. While it is widely believed that BERT (and its variations) is most suitable for NLP classification tasks, we find out that GPT-2 in conjunction with nonlinear dimension reduction methods such as UMAP provide better separation and stronger clustering. This makes GPT-2 + UMAP an interesting alternative. In our case, no model fine-tuning is required, and the pre-trained out-of-the-box GPT-2 model is enough. We also used a fine-tuned DistilBERT model for classification detecting which President delivered which address, with very good results (accuracy 93% - 95% depending on the run). An analogous task was performed to determine the year of writing, and we were able to pin it down to about 4 years (which is a single presidential term). It is worth noting that SOTU addresses provide relatively small writing samples (with about 8'000 words on average, and varying widely from under 2'000 words to more than 20'000), and that the number of authors is relatively large (we used SOTU addresses of 42 US presidents). This shows that the techniques employed turn out to be rather efficient, while all the computations described in this note can be performed using a single GPU instance of Google Colab. The accompanying code is available on GitHub.
翻译:本文利用Kaggle平台上的国情咨文(SOTU)数据集,对美国历史的时间线以及这些咨文本身的特征与本质进行了若干令人意外(部分亦在意料之中)的观察。主要方法采用向量嵌入技术,如BERT(DistilBERT)和GPT-2。尽管普遍观点认为BERT及其变体最适合NLP分类任务,但本研究发现,GPT-2结合非线性降维方法(如UMAP)能够实现更优的分离效果和更强的聚类性能,这使GPT-2+UMAP成为一种颇具吸引力的替代方案。在本研究中无需对模型进行微调,预训练的现成GPT-2模型已足够使用。我们还采用微调后的DistilBERT模型进行分类任务,以识别每篇咨文对应的总统,取得了优异结果(准确率达93%-95%,随运行批次略有浮动)。针对写作年份的类似任务,我们成功将预测误差缩小至约4年(相当于一个总统任期)。值得注意的是,SOTU咨文样本篇幅相对较小(平均约8000词,且跨度极大——从不足2000词到超过20000词),而作者数量较多(本研究使用了42位美国总统的SOTU咨文)。这表明所采用的技术方案具有较高效率,且本文所述所有计算仅需使用谷歌Colab单GPU实例即可完成。相关代码已开源至GitHub平台。