Swin-BERT: A Feature Fusion System designed for Speech-based Alzheimer's Dementia Detection

Speech is usually used for constructing an automatic Alzheimer's dementia (AD) detection system, as the acoustic and linguistic abilities show a decline in people living with AD at the early stages. However, speech includes not only AD-related local and global information but also other information unrelated to cognitive status, such as age and gender. In this paper, we propose a speech-based system named Swin-BERT for automatic dementia detection. For the acoustic part, the shifted windows multi-head attention that proposed to extract local and global information from images, is used for designing our acoustic-based system. To decouple the effect of age and gender on acoustic feature extraction, they are used as an extra input of the designed acoustic system. For the linguistic part, the rhythm-related information, which varies significantly between people living with and without AD, is removed while transcribing the audio recordings into transcripts. To compensate for the removed rhythm-related information, the character-level transcripts are proposed to be used as the extra input of a word-level BERT-style system. Finally, the Swin-BERT combines the acoustic features learned from our proposed acoustic-based system with our linguistic-based system. The experiments are based on the two datasets provided by the international dementia detection challenges: the ADReSS and ADReSSo. The results show that both the proposed acoustic and linguistic systems can be better or comparable with previous research on the two datasets. Superior results are achieved by the proposed Swin-BERT system on the ADReSS and ADReSSo datasets, which are 85.58\% F-score and 87.32\% F-score respectively.

翻译：语音通常被用于构建自动化的阿尔茨海默病痴呆（AD）检测系统，因为AD患者在早期阶段会表现出声学和语言能力的下降。然而，语音不仅包含与AD相关的局部和全局信息，还包含其他与认知状态无关的信息，例如年龄和性别。本文中，我们提出了一种名为Swin-BERT的基于语音的自动化痴呆检测系统。在声学部分，我们采用最初为从图像中提取局部和全局信息而提出的移位窗口多头注意力机制来设计我们的声学系统。为了解耦年龄和性别对声学特征提取的影响，它们被用作所设计声学系统的额外输入。在语言部分，将音频记录转录为文本时，我们移除了在AD患者与非AD人群之间差异显著的节奏相关信息。为了补偿被移除的节奏相关信息，我们提出将字符级转录文本作为词级BERT风格系统的额外输入。最后，Swin-BERT将我们提出的声学系统学习到的声学特征与我们的语言系统相结合。实验基于国际痴呆检测挑战赛提供的两个数据集：ADReSS和ADReSSo。结果表明，所提出的声学和语言系统在两个数据集上的表现均优于或可比于先前的研究。所提出的Swin-BERT系统在ADReSS和ADReSSo数据集上取得了优异的结果，其F分数分别为85.58%和87.32%。