Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Yoad Tewel; Yoav Shalev; Idan Schwartz; Lior Wolf

視覚的意味論的算術のためのゼロショット画像からテキストへの生成

最近のテキストと画像のマッチングモデルは、キュレーションされていない画像と文のペアの大規模なコーパスに対照的な学習を適用します。このようなモデルは、マッチングとそれに続くゼロショットタスクに強力なスコアを提供できますが、画像を指定してキャプションを生成することはできません。この作業では、このようなモデルを再利用して、推論時に画像が与えられた説明テキストを生成します。これ以上のトレーニングや調整手順は必要ありません。これは、視覚的意味モデルを大規模な言語モデルと組み合わせることによって行われ、両方のWebスケールモデルの知識の恩恵を受けています。結果として得られるキャプションは、監視されたキャプション方法によって取得されるキャプションよりもはるかに制限が少なくなります。さらに、ゼロショット学習方法として、それは非常に柔軟性があり、入力が画像またはテキストのいずれかであり、出力が文である画像演算を実行する能力を実証します。これにより、2つの画像の比較や視覚的な類推テストの解決など、新しい高レベルの視覚機能が可能になります。

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests.

updated: Mon Nov 29 2021 11:01:49 GMT+0000 (UTC)

published: Mon Nov 29 2021 11:01:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト