Improving Zero-shot Generalization and Robustness of Multi-modal Models

Yunhao Ge; Jie Ren; Andrew Gallagher; Yuxiao Wang; Ming-Hsuan Yang; Hartwig Adam; Laurent Itti; Balaji Lakshminarayanan; Jiaping Zhao

マルチモーダルモデルのゼロショット一般化と堅牢性の向上

CLIP や LiT などのマルチモーダル画像テキストモデルは、画像分類ベンチマークで優れたパフォーマンスを実証しており、そのゼロショット汎化能力は特に魅力的です。これらのモデルのトップ 5 のゼロショット精度は非常に高いですが、トップ 1 の精度ははるかに低くなります (場合によっては 25% 以上の差)。このパフォーマンスギャップの原因を調査したところ、失敗ケースの多くはテキストプロンプトのあいまいさが原因であることがわかりました。まず、複数のプロンプトと画像変換に関する予測の一貫性を測定することにより、トップ 1 予測が間違っている可能性が高い画像を識別するための、シンプルで効率的なゼロショットポストホック手法を開発します。私たちの手順が間違いをよりよく予測し、選択的予測タスクで一般的な最大ロジットベースラインを上回るパフォーマンスを示すことを示します。次に、WordNet 階層を利用して、そのような不確実な画像の精度を向上させる簡単かつ効率的な方法を提案します。具体的には、セマンティックラベル階層からその親と子を組み込むことで元のクラスを拡張し、その拡張をテキストプロンプトにプラグインします。 5 つの異なる ImageNet ベースのデータセットを使用して、CLIP モデルと LiT モデルの両方で実験を実施します。 CLIP の場合、私たちの方法により、トップ 1 の精度が不確実なサブセットで 17.13%、ImageNet 検証セット全体で 3.6% 向上しました。また、ImageNet シフトデータセット、他の 4 つのデータセット、および LiT などの他のモデルアーキテクチャ全体でこの方法が改善されることも示します。提案された方法はハイパーパラメータを必要とせず、追加のモデルトレーニングを必要とせず、他の大規模なマルチモーダルアーキテクチャに簡単に拡張できます。コードは https://github.com/gyhandy/Hierarchy-CLIP で入手できます。

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text prompts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets, four other datasets, and other model architectures such as LiT. The proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures. Code is available at https://github.com/gyhandy/Hierarchy-CLIP.

updated: Thu May 25 2023 17:14:50 GMT+0000 (UTC)

published: Sun Dec 04 2022 07:26:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト