InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

Zhaoyang Liu; Yinan He; Wenhai Wang; Weiyun Wang; Yi Wang; Shoufa Chen; Qinglong Zhang; Yang Yang; Qingyun Li; Jiashuo Yu; Kunchang Li; Zhe Chen; Xue Yang; Xizhou Zhu; Yali Wang; Limin Wang; Ping Luo; Jifeng Dai; Yu Qiao

InternGPT: 言語を超えて ChatGPT と対話してビジョン中心のタスクを解決する

私たちは、InternGPT (略して iGPT) という名前のインタラクティブなビジュアルフレームワークを紹介します。このフレームワークは、ChatGPT などの計画および推論機能を備えたチャットボットと、ユーザーが画面上の画像やビデオを直接操作できるポインティング動作などの非言語指示を統合します。ポインティング (ジェスチャー、カーソルなどを含む) の動きにより、ビジュアルコンテンツのきめ細かな制御、編集、生成が必要な視覚中心のタスクを実行する際の柔軟性と精度が向上します。 InternGPT という名前は、インタラクション、非言語、チャットボットの略です。純粋な言語に依存する既存のインタラクティブシステムとは異なり、提案された iGPT は、ポインティング命令を組み込むことにより、ユーザーとチャットボット間のコミュニケーションの効率を大幅に向上させるだけでなく、ビジョン中心のタスク、特に複雑なビジュアルシナリオにおけるチャットボットの精度を大幅に向上させます。オブジェクトの数は 2 より大きくなります。さらに、iGPT では、LLM の制御能力を向上させるために補助制御メカニズムが使用され、Husky と呼ばれる大規模な視覚言語モデルが高品質のマルチモーダル対話用に微調整されています (印象的なもの)。 ChatGPT-3.5-turbo (93.89% GPT-4 品質)。私たちは、この作品が将来のインタラクティブなビジュアルシステムに対する新しいアイデアや方向性を生み出すきっかけとなることを願っています。 https://github.com/OpenGVLab/InternGPT でコードをご覧ください。

We present an interactive visual framework named InternGPT, or iGPT for short. The framework integrates chatbots that have planning and reasoning capabilities, such as ChatGPT, with non-verbal instructions like pointing movements that enable users to directly manipulate images or videos on the screen. Pointing (including gestures, cursors, etc.) movements can provide more flexibility and precision in performing vision-centric tasks that require fine-grained control, editing, and generation of visual content. The name InternGPT stands for interaction, nonverbal, and chatbots. Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2. Additionally, in iGPT, an auxiliary control mechanism is used to improve the control capability of LLM, and a large vision-language model termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89% GPT-4 Quality). We hope this work can spark new ideas and directions for future interactive visual systems. Welcome to watch the code at https://github.com/OpenGVLab/InternGPT.

updated: Thu May 11 2023 14:48:24 GMT+0000 (UTC)

published: Tue May 09 2023 17:58:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト