DIME-FM: DIstilling Multimodal and Efficient Foundation Models

Ximeng Sun; Pengchuan Zhang; Peizhao Zhang; Hardik Shah; Kate Saenko; Xide Xia

DIME-FM: マルチモーダルで効率的な基盤モデルの蒸留

CLIP、ALIGN、Florence などの大規模な視覚言語基盤モデル (VLFM) は、画像とキャプションのペアの大規模なデータセットでトレーニングされ、下流のタスクで優れた転送可能性と堅牢性を実現しますが、多くの実用的なアプリケーションで使用するのは困難です。サイズが大きく、待ち時間が長く、アーキテクチャが固定されているためです。残念ながら、最近の研究では、リソースが限られているアプリケーション向けに小さなカスタム VLFM をトレーニングすることは、公開された小規模なデータを使用して現在非常に困難であることが示されています。この論文では、比較的少量の安価で対応のない画像と文を使用して、大規模なVLFMに含まれる知識をより小さなカスタマイズされた基礎モデルに移すことを可能にする新しい蒸留メカニズム（DIME-FM）を紹介します。事前にトレーニングされた CLIP-ViTL/14 モデルから ViT-B/32 モデルに知識を移し、40M の公開画像と 28.4M の対になっていない公開文のみを使用します。結果として得られるモデル「Distill-ViT-B/32」は、プライベート WiT データセット (4 億の画像とテキストのペア) で事前トレーニングされた CLIP-ViT-B/32 モデルに匹敵します。 ImageNet と ELEVATER (20 の画像分類タスク) ベンチマークの両方でのゼロショットおよび線形プロービングパフォーマンス。また、ImageNet からの自然な分布シフトを持つ 5 つのデータセットで評価すると、同等の堅牢性が示されます。

Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to their large size, high latency and fixed architectures. Unfortunately, recent work shows training a small custom VLFM for resource-limited applications is currently very difficult using public and smaller-scale data. In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences. We transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32 model, with only 40M public images and 28.4M unpaired public sentences. The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves similar results in terms of zero-shot and linear-probing performance on both ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also displays comparable robustness when evaluated on five datasets with natural distribution shifts from ImageNet.

updated: Fri Mar 31 2023 17:47:23 GMT+0000 (UTC)

published: Fri Mar 31 2023 17:47:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト