Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks

Denis Coquenet; Clément Rambour; Emanuele Dalsasso; Nicolas Thome

視覚言語基盤モデルを活用してきめ細かい下流タスクを実現

CLIP などのビジョン言語基盤モデルは、特にフリーテキスト入力のおかげで、多くのタスクやデータセットで印象的なゼロショットパフォーマンスを示しています。ただし、きめ細かい属性検出や位置特定など、一部の下流タスクを処理するのは困難です。この論文では、視覚言語基盤モデルの能力をさらに活用するために、ポジティブ/ネガティブプロンプト定式化に基づいたマルチタスク微調整戦略を提案します。 CLIP アーキテクチャをベースラインとして使用すると、鳥の詳細な属性検出と位置特定タスクが大幅に改善されると同時に、CUB200-2011 データセットの分類パフォーマンスも向上します。再現性を目的としてソースコードを提供しています。https://github.com/FactoDeepLearning/MultitaskVLFM で入手できます。

Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models. Using the CLIP architecture as baseline, we show strong improvements on bird fine-grained attribute detection and localization tasks, while also increasing the classification performance on the CUB200-2011 dataset. We provide source code for reproducibility purposes: it is available at https://github.com/FactoDeepLearning/MultitaskVLFM.

updated: Thu Jul 13 2023 15:05:34 GMT+0000 (UTC)

published: Thu Jul 13 2023 15:05:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト