We attempt to detect out-of-distribution (OOD) text samples though applying Topological Data Analysis (TDA) to attention maps in transformer-based language models. We evaluate our proposed TDA-based approach for out-of-distribution detection on BERT, a transformer-based language model, and compare the to a more traditional OOD approach based on BERT CLS embeddings. We found that our TDA approach outperforms the CLS embedding approach at distinguishing in-distribution data (politics and entertainment news articles from HuffPost) from far out-of-domain samples (IMDB reviews), but its effectiveness deteriorates with near out-of-domain (CNN/Dailymail) or same-domain (business news articles from HuffPost) datasets.
翻译:我们尝试通过将拓扑数据分析应用于基于Transformer语言模型的注意力图来检测分布外文本样本。我们在BERT(一种基于Transformer的语言模型)上评估了所提出的基于TDA的分布外检测方法,并将其与基于BERT CLS嵌入的传统分布外检测方法进行比较。我们发现,在区分分布内数据(来自HuffPost的政治和娱乐新闻文章)与远域外样本(IMDB评论)时,我们的TDA方法优于CLS嵌入方法,但在近域外(CNN/Dailymail)或同域(来自HuffPost的商业新闻文章)数据集上,其有效性会下降。