Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz; Hanoona Rasheed; Salman Khan; Fahad Shahbaz Khan

Video-ChatGPT: 大きなビジョンと言語モデルによる詳細なビデオの理解に向けて

大規模言語モデル (LLM) を活用した会話エージェントは、ビジュアルデータを操作する新しい方法を提供します。画像ベースの会話モデルに対する初期の試みはありましたが、この研究では、Video-ChatGPT を導入することで、ビデオベースの会話という未開発の分野に取り組んでいます。これは、ビデオに適応したビジュアルエンコーダーと LLM を統合したマルチモーダルモデルです。このモデルは、ビデオに関する人間のような会話を理解し、生成することができます。手動および半自動のパイプラインを介して取得された、Video-ChatGPT のトレーニングに使用される 100,000 個のビデオ命令ペアの新しいデータセットを紹介します。このデータセットは、簡単にスケーラブルでラベルノイズに対して堅牢です。また、提案されたモデルの長所と短所を客観的に分析するために、ビデオベースの対話モデルの定量的評価フレームワークを開発します。私たちのコード、モデル、命令セット、デモは https://github.com/mbzuai-oryx/Video-ChatGPT でリリースされています。

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with a LLM. The model is capable of understanding and generating human-like conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantiative evaluation framework for video-based dialogue models to objectively analyse the strengths and weaknesses of proposed models. Our code, models, instruction-sets and demo are released at https://github.com/mbzuai-oryx/Video-ChatGPT.

updated: Thu Jun 08 2023 17:59:56 GMT+0000 (UTC)

published: Thu Jun 08 2023 17:59:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト