PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Xiaoman Zhang; Chaoyi Wu; Ziheng Zhao; Weixiong Lin; Ya Zhang; Yanfeng Wang; Weidi Xie

PMC-VQA: 医療視覚的質問応答のための視覚的指示のチューニング

この論文では、重要な臨床関連情報を含む医用画像を効率的に解釈する上で重要な、Medical Visual Question Answering (MedVQA) の問題に焦点を当てます。まず、MedVQA の問題を人間と機械の相互作用に自然に従う生成タスクとして再構成し、事前にトレーニングされた視覚エンコーダーからの視覚情報を大規模な言語モデルと調整することにより、医療視覚理解のための生成ベースのモデルを提案します。次に、PMC-VQA という大規模な医療視覚的質問応答データセットを構築するためのスケーラブルなパイプラインを確立します。このデータセットには、さまざまなモダリティや疾患をカバーする 149,000 個の画像からなる 227,000 個の VQA ペアが含まれています。第三に、提案したモデルを PMC-VQA で事前トレーニングし、複数の公開ベンチマーク (VQA-RAD や SLAKE など) で微調整し、既存の研究を大幅に上回ります。さらに、手動検証を行ったテストセットを提案しますが、これは非常に難しく、最良のモデルでも解決するのが困難です。

In this paper, we focus on the problem of Medical Visual Question Answering (MedVQA), which is crucial in efficiently interpreting medical images with vital clinic-relevant information. Firstly, we reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction, we propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. Secondly, we establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases. Thirdly, we pre-train our proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD and SLAKE, outperforming existing work by a large margin. Additionally, we propose a test set that has undergone manual verification, which is significantly more challenging, even the best models struggle to solve.

updated: Mon May 29 2023 12:23:21 GMT+0000 (UTC)

published: Wed May 17 2023 17:50:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト