LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching

Duy M. H. Nguyen; Hoang Nguyen; Nghiem T. Diep; Tan N. Pham; Tri Cao; Binh T. Nguyen; Paul Swoboda; Nhat Ho; Shadi Albarqouni; Pengtao Xie; Daniel Sonntag; Mathias Niepert

LVM-Med: 二次グラフマッチングによる医療画像用の大規模自己教師あり視覚モデルの学習

限られた注釈付きサンプルで新しいタスクに合わせて微調整できる大規模な事前トレーニング済みモデルを取得することは、医療画像データにとって未解決の課題のままです。 ImageNet で事前にトレーニングされたディープネットワークと、Web スケールデータでトレーニングされたビジョン言語基盤モデルが一般的なアプローチですが、自然画像と医療画像の間のドメインの大幅なシフトにより、医療タスクでの有効性は限られています。このギャップを埋めるために、大規模な医療データセットでトレーニングされたディープネットワークの最初のファミリーである LVM-Med を紹介します。当社は、CT、MRI、X 線、超音波などの多数の臓器やモダリティをカバーする、55 の公的に利用可能なデータセットから約 130 万枚の医療画像を収集しました。このデータセットでいくつかの最先端の自己教師ありアルゴリズムのベンチマークを行い、グラフマッチング定式化を使用した新しい自己教師あり対比学習アルゴリズムを提案します。提案されたアプローチは 3 つの貢献をします。(i) ローカルおよびグローバル情報に基づいて、以前のペアごとの画像類似性メトリックを統合します。 (ii) 組み合わせグラフマッチング目標を介して構築された損失関数を通じて、特徴埋め込みの構造的制約を捕捉します。 (iii) ブラックボックスソルバー用の最新の勾配推定技術を使用して、エンドツーエンドで効率的にトレーニングできます。私たちは、セグメンテーションと分類から物体検出に至る 15 の下流の医療タスクに関して、流通環境内と流通外の両方の設定について、提案されている LVM-Med を徹底的に評価します。 LVM-Med は、多くの最先端の教師ありモデル、自己教師ありモデル、および基礎モデルを経験的に上回っています。脳腫瘍の分類や糖尿病性網膜症の等級付けなどの困難なタスクの場合、LVM-Med は、ResNet-50 のみを使用しながら、10 億枚のマスクでトレーニングされた以前の視覚言語モデルを 6 ～ 7% 改善します。

Obtaining large pre-trained models that can be fine-tuned to new tasks with limited annotated samples has remained an open challenge for medical imaging data. While pre-trained deep networks on ImageNet and vision-language foundation models trained on web-scale data are prevailing approaches, their effectiveness on medical tasks is limited due to the significant domain shift between natural and medical images. To bridge this gap, we introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets, covering a large number of organs and modalities such as CT, MRI, X-ray, and Ultrasound. We benchmark several state-of-the-art self-supervised algorithms on this dataset and propose a novel self-supervised contrastive learning algorithm using a graph-matching formulation. The proposed approach makes three contributions: (i) it integrates prior pair-wise image similarity metrics based on local and global information; (ii) it captures the structural constraints of feature embeddings through a loss function constructed via a combinatorial graph-matching objective; and (iii) it can be trained efficiently end-to-end using modern gradient-estimation techniques for black-box solvers. We thoroughly evaluate the proposed LVM-Med on 15 downstream medical tasks ranging from segmentation and classification to object detection, and both for the in and out-of-distribution settings. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models. For challenging tasks such as Brain Tumor Classification or Diabetic Retinopathy Grading, LVM-Med improves previous vision-language models trained on 1 billion masks by 6-7% while using only a ResNet-50.

updated: Sat Nov 18 2023 15:17:08 GMT+0000 (UTC)

published: Tue Jun 20 2023 22:21:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト