Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Zhiqiu Lin; Samuel Yu; Zhiyi Kuang; Deepak Pathak; Deva Ramana

マルチモダリティはユニモダリティを助ける：マルチモーダルモデルによるクロスモーダルの少数ショット学習

最小限の指示で新しいタスクをすばやく学習する機能 (少数ショット学習として知られています) は、インテリジェントエージェントの中心的な側面です。従来の少数ショットベンチマークでは、単一のモダリティからの少数ショットサンプルを使用しますが、そのようなサンプルでは、コンセプトクラス全体を特徴付けるには不十分な場合があります。対照的に、人間はクロスモーダル情報を使用して新しい概念を効率的に学習します。この作業では、犬について読んで吠え声を聞くことで、より優れた視覚的な犬の分類器を実際に構築できることを示しています。そのために、CLIP などの最近のマルチモーダル基盤モデルは本質的にクロスモーダルであり、異なるモダリティを同じ表現空間にマッピングするという事実を利用します。具体的には、さまざまなモダリティにまたがる少数のショットの例から学習する単純なクロスモーダル適応アプローチを提案します。クラス名を追加のワンショットトレーニングサンプルとして転用することで、視覚言語適応のための驚くほど単純な線形分類器で SOTA の結果を達成します。さらに、私たちのアプローチが、プレフィックスの調整、アダプター、分類器のアンサンブルなどの既存の方法に役立つことを示します。最後に、視覚と言語を超えた他のモダリティを探索するために、最初の (私たちの知る限り) オーディオビジュアルの少数ショットベンチマークを構築し、クロスモーダルトレーニングを使用して画像と音声の両方の分類のパフォーマンスを向上させます。

The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better visual dog classifier by reading about dogs and listening to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.

updated: Mon Jan 16 2023 05:40:42 GMT+0000 (UTC)

published: Mon Jan 16 2023 05:40:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト