Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Jong Hak Moon; Hyungyung Lee; Woncheol Shin; Edward Choi

視覚言語事前トレーニングによる医用画像とテキストのマルチモーダル理解と生成

最近、多くの研究が、マルチモーダルの事前トレーニングの目的でBERTアーキテクチャを拡張することにより、画像のキャプションや視覚的な質問応答など、さまざまな視覚言語のマルチモーダルタスクで印象的なパフォーマンスを示しました。この作業では、特に放射線画像と非構造化レポートを使用して、医療分野でのマルチモーダル表現学習タスクの幅広いセットを探索します。 Transformerベースのアーキテクチャと新しいマルチモーダルアテンションマスキングスキームを組み合わせたMedicalVision Language Learner（MedViLL）を提案し、視覚言語理解タスク（画像レポート検索、疾患分類、医療視覚質問応答）とビジョン言語生成タスク（レポート生成）。 2つの胸部X線画像データセット（MIMIC-CXRおよびOpen-I）を使用して4つのダウンストリームタスクで提案されたモデルを厳密に評価することにより、タスク固有のアーキテクチャを含むさまざまなベースラインに対するMedViLLの優れたダウンストリームタスクパフォーマンスを実証します。

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multimodal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (image-report retrieval, disease classification, medical visual question answering) and vision-language generation task (report generation). By rigorously evaluating the proposed model on four downstream tasks with two chest X-ray image datasets (MIMIC-CXR and Open-I), we empirically demonstrate the superior downstream task performance of MedViLL against various baselines including task-specific architectures.

updated: Mon May 24 2021 15:14:09 GMT+0000 (UTC)

published: Mon May 24 2021 15:14:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト