InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language

Zhaoyang Liu; Yinan He; Wenhai Wang; Weiyun Wang; Yi Wang; Shoufa Chen; Qinglong Zhang; Yang Yang; Qingyun Li; Jiashuo Yu; Kunchang Li; Zhe Chen; Xue Yang; Xizhou Zhu; Yali Wang; Limin Wang; Ping Luo; Jifeng Dai; Yu Qiao

InternChat: 言語を超えてチャットボットと対話することで、ビジョン中心のタスクを解決する

私たちは、InternChat (略して iChat) という名前のインタラクティブなビジュアルフレームワークを紹介します。このフレームワークは、ChatGPT などの計画および推論機能を備えたチャットボットと、ユーザーが画面上の画像やビデオを直接操作できるポインティング動作などの非言語指示を統合します。ポインティング (ジェスチャー、カーソルなどを含む) の動きにより、ビジュアルコンテンツのきめ細かな制御、編集、生成が必要な視覚中心のタスクを実行する際の柔軟性と精度が向上します。 InternChat という名前は、インタラクション、非言語、チャットボットを意味します。純粋な言語に依存する既存の対話型システムとは異なり、提案された iChat は、ポインティング命令を組み込むことにより、ユーザーとチャットボット間のコミュニケーションの効率を大幅に向上させるだけでなく、視覚中心のタスク、特に複雑な視覚的シナリオにおけるチャットボットの精度を大幅に向上させます。オブジェクトの数は 2 よりも大きいです。さらに、iChat では、LLM の制御能力を向上させるために補助制御メカニズムが使用され、Husky と呼ばれる大規模な視覚言語モデルが高品質のマルチモーダル対話のために微調整されています (印象的な) ChatGPT-3.5-turbo (93.89% GPT-4 品質)。私たちは、この作品が将来のインタラクティブなビジュアルシステムに対する新しいアイデアや方向性を生み出すきっかけとなることを願っています。 https://github.com/OpenGVLab/InternChat でコードをご覧ください。

We present an interactive visual framework named InternChat, or iChat for short. The framework integrates chatbots that have planning and reasoning capabilities, such as ChatGPT, with non-verbal instructions like pointing movements that enable users to directly manipulate images or videos on the screen. Pointing (including gestures, cursors, etc.) movements can provide more flexibility and precision in performing vision-centric tasks that require fine-grained control, editing, and generation of visual content. The name InternChat stands for interaction, nonverbal, and chatbots. Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iChat significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2. Additionally, in iChat, an auxiliary control mechanism is used to improve the control capability of LLM, and a large vision-language model termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89% GPT-4 Quality). We hope this work can spark new ideas and directions for future interactive visual systems. Welcome to watch the code at https://github.com/OpenGVLab/InternChat.

updated: Wed May 10 2023 17:45:08 GMT+0000 (UTC)

published: Tue May 09 2023 17:58:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト