Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation

Yinda Chen; Che Liu; Wei Huang; Sibo Cheng; Rossella Arcucci; Zhiwei Xiong

統合医療画像セグメンテーションのための生成テキストガイド付き 3D 視覚言語事前トレーニング

Vision-Language Pretraining (VLP) は、注釈なしで画像のテキスト記述から視覚表現を学習する際に優れた機能を実証しました。しかし、効果的な VLP には大規模な画像とテキストのペアが必要ですが、医療分野ではこのリソースが不足しています。さらに、従来の VLP は 2D 画像に限定されていますが、医療画像にはさまざまなモダリティ (多くの場合 3D) が含まれているため、学習プロセスがより困難になっています。これらの課題に対処するために、私たちは、ペアのテキスト記述に依存せずに VLP を 3D 医用画像に拡張するフレームワークである、統合医用画像セグメンテーション (GTGM) のための生成テキストガイド付き 3D ビジョン言語事前トレーニングを紹介します。具体的には、GTGM はラージ言語モデル (LLM) を利用して、3D 医療画像から医療スタイルのテキストを生成します。この合成テキストは、3D 視覚表現の学習を監督するために使用されます。さらに、拡張された 3D 医用画像パッチ間の一貫した視覚表現を育成するために、ネガティブフリーの対比学習目標戦略が導入されており、これにより、厳密なポジティブとネガティブのサンプルの組み合わせに関連するバイアスが効果的に軽減されます。 13 のデータセットを対象に、コンピューター断層撮影 (CT)、磁気共鳴画像法 (MRI)、電子顕微鏡 (EM) の 3 つの画像診断法で GTGM を評価します。さまざまな医用画像セグメンテーションタスクにわたる GTGM の優れたパフォーマンスは、ペアのテキストの必要性を回避しながら 3D 医用画像への VLP 拡張を可能にすることで、その有効性と多用途性を強調します。

Vision-Language Pretraining (VLP) has demonstrated remarkable capabilities in learning visual representations from textual descriptions of images without annotations. Yet, effective VLP demands large-scale image-text pairs, a resource that suffers scarcity in the medical domain. Moreover, conventional VLP is limited to 2D images while medical images encompass diverse modalities, often in 3D, making the learning process more challenging. To address these challenges, we present Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation (GTGM), a framework that extends of VLP to 3D medical images without relying on paired textual descriptions. Specifically, GTGM utilizes large language models (LLM) to generate medical-style text from 3D medical images. This synthetic text is then used to supervise 3D visual representation learning. Furthermore, a negative-free contrastive learning objective strategy is introduced to cultivate consistent visual representations between augmented 3D medical image patches, which effectively mitigates the biases associated with strict positive-negative sample pairings. We evaluate GTGM on three imaging modalities - Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and electron microscopy (EM) over 13 datasets. GTGM's superior performance across various medical image segmentation tasks underscores its effectiveness and versatility, by enabling VLP extension into 3D medical imagery while bypassing the need for paired text.

updated: Wed Jun 07 2023 22:20:51 GMT+0000 (UTC)

published: Wed Jun 07 2023 22:20:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト