The CLIP Model is Secretly an Image-to-Prompt Converter

Yuxuan Ding; Chunna Tian; Haoxuan Ding; Lingqiao Liu

CLIP モデルは密かに画像からプロンプトへのコンバーターです

安定拡散モデルは、入力としてテキストプロンプトに依存する著名なテキストから画像への生成モデルです。テキストプロンプトは、Contrastive Language-Image Pre-Training (CLIP) を使用してエンコードされます。ただし、参照画像から暗黙的な情報を組み込む場合、テキストプロンプトには制限があります。既存の方法は、画像から画像を生成するために何百万ものトレーニングサンプルを必要とする高価なトレーニング手順を採用することで、この制限に対処しようと試みてきました。対照的に、この論文は、安定拡散で利用される CLIP モデルが本質的に画像をテキストプロンプトに瞬時に変換する機能を備えていることを示しています。このような画像からプロンプトへの変換は、閉じた形式で計算される線形射影行列を利用することで実現できます。さらに、論文では、少量の類似ドメインのトレーニングデータ (約 100 枚の画像) を利用するか、参照画像にいくつかのオンライントレーニングステップ (約 30 回の反復) を組み込むことで、この機能をさらに強化できることを示しています。これらのアプローチを活用することで、提案された方法は、画像とテキストプロンプトの間のギャップを埋めるためのシンプルで柔軟なソリューションを提供します。この方法論は、画像のバリエーションや画像編集などのさまざまなタスクに適用でき、画像とテキストプロンプトの間のより効果的でシームレスな対話を促進します。

The Stable Diffusion model is a prominent text-to-image generation model that relies on a text prompt as its input, which is encoded using the Contrastive Language-Image Pre-Training (CLIP). However, text prompts have limitations when it comes to incorporating implicit information from reference images. Existing methods have attempted to address this limitation by employing expensive training procedures involving millions of training samples for image-to-image generation. In contrast, this paper demonstrates that the CLIP model, as utilized in Stable Diffusion, inherently possesses the ability to instantaneously convert images into text prompts. Such an image-to-prompt conversion can be achieved by utilizing a linear projection matrix that is calculated in a closed form. Moreover, the paper showcases that this capability can be further enhanced by either utilizing a small amount of similar-domain training data (approximately 100 images) or incorporating several online training steps (around 30 iterations) on the reference images. By leveraging these approaches, the proposed method offers a simple and flexible solution to bridge the gap between images and text prompts. This methodology can be applied to various tasks such as image variation and image editing, facilitating more effective and seamless interaction between images and textual prompts.

updated: Mon May 22 2023 04:52:12 GMT+0000 (UTC)

published: Mon May 22 2023 04:52:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト