Increasing Textual Context Size Boosts Medical Image-Text Matching

Idan Glassberg; Tom Hope

テキストコンテキストのサイズを大きくすると、医療画像とテキストのマッチングが促進されます

この短いテクニカルレポートでは、医療画像とテキストのマッチングタスクで最先端の結果をもたらす簡単な手法を紹介します。一般的な画像とテキストのマッチングモデルである OpenAI の CLIP の使用を分析し、CLIP の制限されたテキスト入力サイズが、より長いテキストコンテキストのエンコードがしばしば必要とされる医療分野の下流のパフォーマンスに悪影響を及ぼすことを観察します。したがって、テキストキャプションをエンコードする単純なスライディングウィンドウ技術でトレーニングされた ClipMD をトレーニングしてリリースします。 ClipMD は、2 つの医療画像テキストデータセットでテストされ、他の画像テキストマッチングモデルと比較されました。結果は、ClipMD が両方のデータセットで他のモデルよりも大幅に優れていることを示しています。コードと事前トレーニング済みのモデルを公開しています。

This short technical report demonstrates a simple technique that yields state of the art results in medical image-text matching tasks. We analyze the use of OpenAI's CLIP, a general image-text matching model, and observe that CLIP's limited textual input size has negative impact on downstream performance in the medical domain where encoding longer textual contexts is often required. We thus train and release ClipMD, which is trained with a simple sliding window technique to encode textual captions. ClipMD was tested on two medical image-text datasets and compared with other image-text matching models. The results show that ClipMD outperforms other models on both datasets by a large margin. We make our code and pretrained model publicly available.

updated: Thu Mar 23 2023 15:20:05 GMT+0000 (UTC)

published: Thu Mar 23 2023 15:20:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト