Self-supervised vision-language pretraining for Medical visual question answering

Pengfei Li; Gang Liu; Lin Tan; Jinying Liao; Shenjun Zhong

医療用視覚的質問応答のための自己管理型視覚言語事前トレーニング

医用画像の視覚的質問応答 (VQA) は、X 線画像が与えられたときに臨床上の質問に答えるタスクです。これは、視覚と言語の両方の情報を統合するモデルを必要とする困難な問題です。限られた数のトレーニングデータで医療 VQA 問題を解決するために、モデルの一般化を改善するために、事前トレーニングと微調整のパラダイムが広く使用されています。このホワイトペーパーでは、マスクされた画像モデリング、マスクされた言語モデリング、画像テキストマッチング、コントラスト学習 (M2I2) を介した画像テキストアライメントを医療画像キャプションデータセットの事前トレーニングに適用し、下流の医療 VQA タスクに微調整する自己教師あり方法を提案します。提案された方法は、3 つの公開医療 VQA データセットすべてで最先端のパフォーマンスを実現します。私たちのコードとモデルは、https://github.com/pengfeiliHEU/M2I2 で入手できます。

Medical image visual question answering (VQA) is a task to answer clinical questions, given a radiographic image, which is a challenging problem that requires a model to integrate both vision and language information. To solve medical VQA problems with a limited number of training data, pretrain-finetune paradigm is widely used to improve the model generalization. In this paper, we propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining on medical image caption dataset, and finetunes to downstream medical VQA tasks. The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets. Our codes and models are available at https://github.com/pengfeiliHEU/M2I2.

updated: Thu Nov 24 2022 13:31:56 GMT+0000 (UTC)

published: Thu Nov 24 2022 13:31:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト