SVIT: Scaling up Visual Instruction Tuning

Bo Zhao; Boya Wu; Tiejun Huang

SVIT: ビジュアル命令チューニングのスケールアップ

基礎モデルの登場により、大規模な言語モデルと視覚モデルが統合され、視覚的なキャプション、対話、質問応答などのマルチモーダルな能力を獲得できるようになりました。既存のマルチモーダルモデルは、視覚的な理解と推論において優れたパフォーマンスを示しますが、その限界は依然として大きく残っています。高品質の命令チューニングデータが不足しているため、十分に調査されていません。マルチモーダル機能の限界を押し上げるために、160 万の会話質問と回答 (QA) ペアおよび 160 万の複雑な推論 QA ペアと 106,000 の詳細画像を含む 320 万のビジュアル命令チューニングデータセットを構築することで、ビジュアル命令チューニング (SVIT) を販売します。説明。提案されたデータセットは、量に加えて、GPT-4 に画像の豊富な手動アノテーションを促すことによって生成される、高品質で豊かな多様性によっても特徴付けられます。私たちは、SVIT でマルチモーダルモデルをトレーニングすると、視覚認識、推論、計画の面でマルチモーダルパフォーマンスを大幅に向上できることを経験的に検証しています。

Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, dialogue, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Sale up Visual Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs and 1.6M complex reasoning QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We empirically verify that training multimodal models on SVIT can significantly improve the multimodal performance in terms of visual perception, reasoning and planing.

updated: Sun Jul 09 2023 03:25:14 GMT+0000 (UTC)

published: Sun Jul 09 2023 03:25:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト