Detecting Dementia from Speech and Transcripts using Transformers

Loukas Ilias; Dimitris Askounis; John Psarras

トランスフォーマーを使用した音声とトランスクリプトからの認知症の検出

アルツハイマー病 (AD) は、利用可能な治療法がないため早期に診断されない場合、人々の日常生活に深刻な影響を与える神経変性疾患を構成します。アルツハイマー病は認知症の最も一般的な原因であり、認知症は記憶喪失の一般的な用語を構成します。認知症は発話に影響を与えるという事実のため、既存の研究イニシアチブは自発的な発話から認知症を検出することに焦点を当てています。ただし、音声データの Log-Mel スペクトログラムとメル周波数ケプストラム係数 (MFCC) への変換、および事前トレーニング済みモデルの使用に関しては、ほとんど作業が行われていません。同時に、変換ネットワークの使用と、2 つのモダリティ、つまり音声とトランスクリプトを 1 つのニューラルネットワークに結合する方法の両方に関して、ほとんど作業が行われていません。これらの制限に対処するために、最初に音声信号を画像として表現し、いくつかの事前トレーニング済みモデルを採用し、ビジョントランスフォーマー (ViT) が最高の評価結果を達成しました。次に、マルチモーダルモデルを提案します。より具体的には、導入されたモデルには、2つのモダリティ間の関係を効果的な方法で捉えるために、最終的な分類とクロスモーダルな注意に対する各モダリティの影響を制御するために、Gated Multimodal Unit が含まれています。 ADReSS Challenge データセットで実施された広範な実験は、提案されたモデルの有効性と、最先端のアプローチに対するそれらの優位性を示しています。

Alzheimer's disease (AD) constitutes a neurodegenerative disease with serious consequences to peoples' everyday lives, if it is not diagnosed early since there is no available cure. Alzheimer's is the most common cause of dementia, which constitutes a general term for loss of memory. Due to the fact that dementia affects speech, existing research initiatives focus on detecting dementia from spontaneous speech. However, little work has been done regarding the conversion of speech data to Log-Mel spectrograms and Mel-frequency cepstral coefficients (MFCCs) and the usage of pretrained models. Concurrently, little work has been done in terms of both the usage of transformer networks and the way the two modalities, i.e., speech and transcripts, are combined in a single neural network. To address these limitations, first we represent speech signal as an image and employ several pretrained models, with Vision Transformer (ViT) achieving the highest evaluation results. Secondly, we propose multimodal models. More specifically, our introduced models include Gated Multimodal Unit in order to control the influence of each modality towards the final classification and crossmodal attention so as to capture in an effective way the relationships between the two modalities. Extensive experiments conducted on the ADReSS Challenge dataset demonstrate the effectiveness of the proposed models and their superiority over state-of-the-art approaches.

updated: Tue Jan 17 2023 10:57:43 GMT+0000 (UTC)

published: Wed Oct 27 2021 21:00:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト