Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing

Sheng Zhang; Yanbo Xu; Naoto Usuyama; Jaspreet Bagga; Robert Tinn; Sam Preston; Rajesh Rao; Mu Wei; Naveen Valluri; Cliff Wong; Matthew P. Lungren; Tristan Naumann; Hoifung Poon

生物医学的視覚言語処理のための大規模なドメイン固有の事前トレーニング

視覚言語処理 (VLP) では、CLIP および関連する方法で例示されるように、並列の画像テキストデータに対する対照的な事前トレーニングが大きな成功を収めています。ただし、以前の調査では、Web の一般的なドメインに焦点を当てる傾向があります。生物医学の画像とテキストはかなり異なりますが、公開されているデータセットは小さく、胸部 X 線に偏っているため、進歩が著しく制限されています。この論文では、PubMed Central の生物医学研究論文から抽出された 1,500 万の図とキャプションのペアを使用して、生物医学 VLP に関する最大の研究を実施しました。私たちのデータセット (PMC-15M) は、MIMIC-CXR などの既存の生物医学画像テキストデータセットよりも 2 桁大きく、さまざまな範囲の生物医学画像にまたがっています。標準の CLIP メソッドは、生物医学ドメインには最適ではありません。生物医学 VLP に合わせたドメイン固有の適応と BiomedCLIP を提案します。検索から分類、視覚的質問応答 (VQA) まで、標準的な生物医学イメージングタスクに関する広範な実験とアブレーション研究を実施しました。 BiomedCLIP は、幅広い標準データセットで新しい最先端技術を確立し、以前の VLP アプローチを大幅に上回りました。驚くべきことに、BiomedCLIP は、RSNA 肺炎検出などの放射線学固有のタスクで BioViL などの放射線学固有の最先端モデルよりも優れており、すべての生物医学画像タイプにわたる大規模な事前トレーニングにおける有用性を強調しています。生物医学 VLP の将来の研究を促進するために、https://aka.ms/biomedclip でモデルをリリースします。

Contrastive pretraining on parallel image-text data has attained great success in vision-language processing (VLP), as exemplified by CLIP and related methods. However, prior explorations tend to focus on general domains in the web. Biomedical images and text are rather different, but publicly available datasets are small and skew toward chest X-ray, thus severely limiting progress. In this paper, we conducted by far the largest study on biomedical VLP, using 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. Our dataset (PMC-15M) is two orders of magnitude larger than existing biomedical image-text datasets such as MIMIC-CXR, and spans a diverse range of biomedical images. The standard CLIP method is suboptimal for the biomedical domain. We propose BiomedCLIP with domain-specific adaptations tailored to biomedical VLP. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP established new state of the art in a wide range of standard datasets, substantially outperformed prior VLP approaches. Surprisingly, BiomedCLIP even outperformed radiology-specific state-of-the-art models such as BioViL on radiology-specific tasks such as RSNA pneumonia detection, thus highlighting the utility in large-scale pretraining across all biomedical image types. We will release our models at https://aka.ms/biomedclip to facilitate future research in biomedical VLP.

updated: Thu Mar 02 2023 02:20:04 GMT+0000 (UTC)

published: Thu Mar 02 2023 02:20:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト