Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu; Shengming Yin; Weizhen Qi; Xiaodong Wang; Zecheng Tang; Nan Duan

Visual ChatGPT: Visual Foundation モデルを使用した会話、描画、および編集

ChatGPT は、多くのドメインにわたって優れた会話能力と推論能力を備えた言語インターフェイスを提供するため、分野を超えた関心を集めています。ただし、ChatGPT は言語でトレーニングされているため、現在、視覚的な世界から画像を処理または生成することはできません。同時に、Visual Transformers や Stable Diffusion などの Visual Foundation Models は、優れた視覚的理解と生成機能を示しますが、1 回限りの固定入力と出力を持つ特定のタスクの専門家にすぎません。この目的のために、さまざまな Visual Foundation モデルを組み込んだ Visual ChatGPT と呼ばれるシステムを構築し、ユーザーが 1) 言語だけでなく画像も送受信して ChatGPT と対話できるようにします。2) 必要な複雑な視覚的な質問または視覚的な編集指示を提供します。マルチステップでの複数の AI モデルのコラボレーション。 3) フィードバックを提供し、修正結果を求める。複数の入力/出力のモデルと視覚的なフィードバックが必要なモデルを考慮して、視覚的なモデル情報を ChatGPT に挿入するための一連のプロンプトを設計します。実験では、Visual ChatGPT が Visual Foundation Models の助けを借りて、ChatGPT の視覚的な役割を調査するための扉を開くことが示されています。私たちのシステムは、https://github.com/microsoft/visual-chatgpt で公開されています。

ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called Visual ChatGPT, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at https://github.com/microsoft/visual-chatgpt.

updated: Wed Mar 08 2023 15:50:02 GMT+0000 (UTC)

published: Wed Mar 08 2023 15:50:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト