Text as Neural Operator: Image Manipulation by Text Instruction

Tianhao Zhang; Hung-Yu Tseng; Lu Jiang; Weilong Yang; Honglak Lee; Irfan Essa

神経演算子としてのテキスト：テキスト命令による画像操作

近年、テキストガイドによる画像操作は、マルチメディアおよびコンピュータビジョンのコミュニティでますます注目を集めています。条件付き画像生成への入力は、画像のみからマルチモダリティに進化しました。このホワイトペーパーでは、複雑なテキスト命令を使用してオブジェクトを追加、削除、または変更することで、ユーザーが複数のオブジェクトを含む画像を編集できるようにする設定について説明します。タスクの入力は、（1）参照画像、および（2）画像への必要な変更を説明する自然言語での命令を含むマルチモーダルです。この問題に取り組むために、GANベースの方法を提案します。重要なアイデアは、テキストをニューラルオペレーターとして扱い、画像の特徴をローカルに変更することです。提案されたモデルが、3つの公開データセットの最近の強力なベースラインに対して良好に機能することを示します。具体的には、忠実度と意味的関連性の高い画像を生成し、画像クエリとして使用すると、検索パフォーマンスが向上します。

In recent years, text-guided image manipulation has gained increasing attention in the multimedia and computer vision community. The input to conditional image generation has evolved from image-only to multimodality. In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects. The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image. We propose a GAN-based method to tackle this problem. The key idea is to treat text as neural operators to locally modify the image feature. We show that the proposed model performs favorably against recent strong baselines on three public datasets. Specifically, it generates images of greater fidelity and semantic relevance, and when used as a image query, leads to better retrieval performance.

updated: Mon Nov 29 2021 16:48:56 GMT+0000 (UTC)

published: Tue Aug 11 2020 07:07:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト