A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering

Xiaofei Huang; Hongfang Gong

医学的視覚的質問応答のための単語と文の埋め込みによる二重注意学習ネットワーク

医療視覚的質問応答 (MVQA) の研究は、コンピューター支援診断の開発に貢献できます。 MVQA は、与えられた医療画像と関連する自然言語の質問に基づいて、正確で説得力のある回答を予測することを目的としたタスクです。このタスクでは、医療知識が豊富な機能コンテンツを抽出し、それらをきめ細かく理解する必要があります。したがって、効果的な特徴抽出および理解スキームを構築することがモデリングの鍵となります。既存の MVQA 質問抽出スキームは、主に単語情報に焦点を当てており、テキスト内の医療情報を無視しています。一方、一部の視覚的およびテキスト的特徴理解スキームは、合理的な視覚的推論のために領域とキーワード間の相関関係を効果的に捉えることができません。この研究では、単語と文の埋め込み (WSDAN) を使用した二重注意学習ネットワークが提案されています。キーワードと医療情報を含む質問の二重埋め込み表現を抽出するモジュール、文埋め込み付きトランスフォーマー (TSE) を設計します。集中的なイントラモーダルおよびインターモーダル相互作用をモデル化するために、自己注意とガイド付き注意からなるデュアルアテンション学習 (DAL) モジュールが提案されています。複数の DAL モジュール (DAL) を使用して、視覚的およびテキストの共同注意を学習することで、理解の粒度を高め、視覚的な推論を改善できます。 ImageCLEF 2019 VQA-MED (VQA-MED 2019) および VQA-RAD データセットに関する実験結果は、提案された方法が以前の最先端の方法よりも優れていることを示しています。アブレーション研究と Grad-CAM マップによると、WSDAN は豊富なテキスト情報を抽出でき、強力な視覚的推論能力を備えています。

Research in medical visual question answering (MVQA) can contribute to the development of computeraided diagnosis. MVQA is a task that aims to predict accurate and convincing answers based on given medical images and associated natural language questions. This task requires extracting medical knowledge-rich feature content and making fine-grained understandings of them. Therefore, constructing an effective feature extraction and understanding scheme are keys to modeling. Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text. Meanwhile, some visual and textual feature understanding schemes cannot effectively capture the correlation between regions and keywords for reasonable visual reasoning. In this study, a dual-attention learning network with word and sentence embedding (WSDAN) is proposed. We design a module, transformer with sentence embedding (TSE), to extract a double embedding representation of questions containing keywords and medical information. A dualattention learning (DAL) module consisting of self-attention and guided attention is proposed to model intensive intramodal and intermodal interactions. With multiple DAL modules (DALs), learning visual and textual co-attention can increase the granularity of understanding and improve visual reasoning. Experimental results on the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate that our proposed method outperforms previous state-of-the-art methods. According to the ablation studies and Grad-CAM maps, WSDAN can extract rich textual information and has strong visual reasoning ability.

updated: Sat Nov 12 2022 01:21:33 GMT+0000 (UTC)

published: Sat Oct 01 2022 08:32:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト