MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

Yash Khare; Viraj Bagal; Minesh Mathew; Adithi Devi; U Deva Priyakumar; CV Jawahar

MMBERT：医療VQAを改善するためのマルチモーダルBERT事前トレーニング

医療領域の画像は、一般領域の画像とは根本的に異なります。したがって、医療ドメインに一般ドメインの視覚的質問応答（VQA）モデルを直接使用することは不可能です。さらに、医用画像の注釈は、費用と時間のかかるプロセスです。これらの制限を克服するために、NLP、ビジョン、および言語タスク用のTransformerスタイルアーキテクチャの自己教師あり事前トレーニングに触発されたソリューションを提案します。私たちの方法は、大規模な医用画像+キャプションデータセットの口実タスクとして画像機能を備えたマスク言語モデリング（MLM）を使用して、より豊富な医用画像とテキストの意味表現を学習することを含みます。提案されたソリューションは、放射線画像用の2つのVQAデータセット（VQA-Med 2019とVQA-RAD）で新しい最先端のパフォーマンスを実現し、以前の最良のソリューションのアンサンブルモデルよりも優れています。さらに、私たちのソリューションは、モデルの解釈可能性に役立つアテンションマップを提供します。コードはhttps://github.com/VirajBagal/MMBERTで入手できます。

Images in the medical domain are fundamentally different from the general domain images. Consequently, it is infeasible to directly employ general domain Visual Question Answering (VQA) models for the medical domain. Additionally, medical images annotation is a costly and time-consuming process. To overcome these limitations, we propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks. Our method involves learning richer medical image and text semantic representations using Masked Language Modeling (MLM) with image features as the pretext task on a large medical image+caption dataset. The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images -- VQA-Med 2019 and VQA-RAD, outperforming even the ensemble models of previous best solutions. Moreover, our solution provides attention maps which help in model interpretability. The code is available at https://github.com/VirajBagal/MMBERT

updated: Sat Apr 03 2021 13:01:19 GMT+0000 (UTC)

published: Sat Apr 03 2021 13:01:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト