MedUnA: Language guided Unsupervised Adaptation of Vision-Language Models for Medical Image Classification

Umaima Rahman; Raza Imam; Dwarikanath Mahapatra; Boulbaba Ben Amor

In medical image classification, supervised learning is challenging due to the lack of labeled medical images. Contrary to the traditional modus operandi of pre-training followed by fine-tuning, this work leverages the visual-textual alignment within Vision-Language models (VLMs) to facilitate the unsupervised learning. Specifically, we propose Medical Unsupervised Adaptation (MedUnA), constituting two-stage training: Adapter Pre-training, and Unsupervised Learning. In the first stage, we use descriptions generated by a Large Language Model (LLM) corresponding to class labels, which are passed through the text encoder BioBERT. The resulting text embeddings are then aligned with the class labels by training a lightweight adapter. We choose LLMs because of their capability to generate detailed, contextually relevant descriptions to obtain enhanced text embeddings. In the second stage, the trained adapter is integrated with the visual encoder of MedCLIP. This stage employs a contrastive entropy-based loss and prompt tuning to align visual embeddings. We incorporate self-entropy minimization into the overall training objective to ensure more confident embeddings, which are crucial for effective unsupervised learning and alignment. We evaluate the performance of MedUnA on three different kinds of data modalities - chest X-rays, eye fundus and skin lesion images. The results demonstrate significant accuracy gain on average compared to the baselines across different datasets, highlighting the efficacy of our approach.

updated: Tue Sep 03 2024 09:25:51 GMT+0000 (UTC)

published: Tue Sep 03 2024 09:25:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト