ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

Zachary Huemann; Junjie Hu; Tyler Bradshaw

ConTEXTual Net: 気胸のセグメンテーションのためのマルチモーダル視覚言語モデル

臨床画像データベースには、医用画像だけでなく、医師が作成したテキストレポートも含まれています。これらの物語のレポートは、多くの場合、疾患の位置、サイズ、および形状を説明していますが、説明的なテキストを使用して医療画像分析を導くことは十分に研究されていません.視覚言語モデルは、画像生成、画像キャプション、視覚的質問応答などのマルチモーダルタスクでますます使用されていますが、医療画像処理ではほとんど使用されていません。この作業では、気胸セグメンテーションのタスクの視覚言語モデルを開発します。私たちのモデル、ConTEXTual Net は、自由形式の放射線レポートに基づいて、胸部 X 線写真で気胸を検出してセグメント化します。 ConTEXTual Net は 0.72 ± 0.02 の Dice スコアを達成しました。これは、主治医のアノテーターと他の医師のアノテーターの間の一致レベル (0.71 ± 0.04) に似ていました。 ConTEXTual Net は、U-Net よりも優れたパフォーマンスを示しました。パフォーマンスを向上させるために、記述言語をセグメンテーションモデルに組み込むことができることを示します。アブレーション研究を通じて、パフォーマンス向上の原因となっているのはテキスト情報であることを示しています。さらに、特定の拡張メソッドが画像とテキストの一致を破ることにより、ConTEXTual Net のセグメンテーションパフォーマンスを悪化させることを示します。この一致を維持し、セグメンテーショントレーニングを改善する一連の増強を提案します。

Clinical imaging databases contain not only medical images but also text reports generated by physicians. These narrative reports often describe the location, size, and shape of the disease, but using descriptive text to guide medical image analysis has been understudied. Vision-language models are increasingly used for multimodal tasks like image generation, image captioning, and visual question answering but have been scarcely used in medical imaging. In this work, we develop a vision-language model for the task of pneumothorax segmentation. Our model, ConTEXTual Net, detects and segments pneumothorax in chest radiographs guided by free-form radiology reports. ConTEXTual Net achieved a Dice score of 0.72 ± 0.02, which was similar to the level of agreement between the primary physician annotator and the other physician annotators (0.71 ± 0.04). ConTEXTual Net also outperformed a U-Net. We demonstrate that descriptive language can be incorporated into a segmentation model for improved performance. Through an ablative study, we show that it is the text information that is responsible for the performance gains. Additionally, we show that certain augmentation methods worsen ConTEXTual Net's segmentation performance by breaking the image-text concordance. We propose a set of augmentations that maintain this concordance and improve segmentation training.

updated: Thu Mar 02 2023 22:36:19 GMT+0000 (UTC)

published: Thu Mar 02 2023 22:36:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト