No Token Left Behind: Explainability-Aided Image Classification and Generation

Roni Paiss; Hila Chefer; Lior Wolf

トークンが残されていない：説明性を利用した画像の分類と生成

コンピュータビジョンにおけるゼロショット学習のアプリケーションは、画像とテキストのマッチングモデルの使用によって革命を起こしました。最も注目すべき例であるCLIPは、ゼロショット分類とテキストプロンプトによる生成モデルのガイドの両方に広く使用されています。ただし、CLIPのゼロショットの使用は、入力テキストの言い回しに関して不安定であるため、使用するプロンプトを慎重に設計する必要があります。この不安定性は、意味的に意味のある入力トークンのサブセットのみに基づく選択的類似度スコアに起因することがわかります。それを軽減するために、以前の作業で使用されたCLIP類似性損失を採用することに加えて、CLIPが入力のすべての関連する意味部分に焦点を当てることを保証する損失項を追加する、新しい説明可能性ベースのアプローチを提示します。迅速なエンジニアリングによるワンショット分類に適用すると、追加のトレーニングや微調整を行うことなく、認識率が向上します。さらに、私たちの方法を使用した生成モデルのCLIPガイダンスにより、生成された画像が大幅に改善されることを示します。最後に、各オブジェクトの画像説明可能性ヒートマップを事前に決定された境界ボックスに限定することを要求することにより、オブジェクトの位置を空間的に調整するテキストベースの画像生成のためのCLIPガイダンスの新しい使用法を示します。

The application of zero-shot learning in computer vision has been revolutionized by the use of image-text matching models. The most notable example, CLIP, has been widely used for both zero-shot classification and guiding generative models with a text prompt. However, the zero-shot use of CLIP is unstable with respect to the phrasing of the input text, making it necessary to carefully engineer the prompts used. We find that this instability stems from a selective similarity score, which is based only on a subset of the semantically meaningful input tokens. To mitigate it, we present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input, in addition to employing the CLIP similarity loss used in previous works. When applied to one-shot classification through prompt engineering, our method yields an improvement in the recognition rate, without additional training or fine-tuning. Additionally, we show that CLIP guidance of generative models using our method significantly improves the generated images. Finally, we demonstrate a novel use of CLIP guidance for text-based image generation with spatial conditioning on object location, by requiring the image explainability heatmap for each object to be confined to a pre-determined bounding box.

updated: Mon Apr 11 2022 07:16:39 GMT+0000 (UTC)

published: Mon Apr 11 2022 07:16:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト