Improving Zero-shot Generalization and Robustness of Multi-modal Models

Yunhao Ge; Jie Ren; Yuxiao Wang; Andrew Gallagher; Ming-Hsuan Yang; Laurent Itti; Hartwig Adam; Balaji Lakshminarayanan; Jiaping Zhao

マルチモーダルモデルのゼロショット一般化とロバスト性の改善

CLIP や LiT などのマルチモーダル画像テキストモデルは、画像分類ベンチマークで印象的なパフォーマンスを示しており、それらのゼロショット一般化機能は特に刺激的です。これらのモデルのトップ 5 のゼロショット精度は非常に高いですが、トップ 1 の精度ははるかに低くなっています (場合によっては 25% を超えるギャップ)。このパフォーマンスギャップの理由を調査したところ、失敗例の多くはテキストプロンプトのあいまいさが原因であることがわかりました。最初に、複数のプロンプトと画像変換に関する予測の一貫性を測定することにより、トップ 1 予測が正しくない可能性が高い画像を識別するためのシンプルで効率的なゼロショット事後法を開発します。私たちの手順が間違いをよりよく予測し、選択的予測タスクで一般的な最大ロジットのベースラインよりも優れていることを示します。次に、WordNet 階層を利用して、このような不確実な画像の精度を向上させる簡単で効率的な方法を提案します。具体的には、セマンティックラベル階層から親と子を組み込むことによって元のクラスを拡張し、その拡張をテキストプロンプトにプラグインします。 5 つの異なる ImageNet ベースのデータセットを使用して、CLIP モデルと LiT モデルの両方で実験を行います。 CLIP の場合、この方法により、トップ 1 の精度が不確実なサブセットで 17.13%、ImageNet 検証セット全体で 3.6% 向上します。また、ImageNet シフトデータセットや LiT などの他のモデルアーキテクチャ全体でこの方法が改善されることも示しています。私たちが提案する方法は、ハイパーパラメータがなく、追加のモデルトレーニングを必要とせず、他の大規模なマルチモーダルアーキテクチャに簡単にスケーリングできます。

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text promts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets and other model architectures such as LiT. Our proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures.

updated: Sun Dec 04 2022 07:26:24 GMT+0000 (UTC)

published: Sun Dec 04 2022 07:26:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト