InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai; Junnan Li; Dongxu Li; Anthony Meng Huat Tiong; Junqi Zhao; Weisheng Wang; Boyang Li; Pascale Fung; Steven Hoi

InstructBLIP: 命令チューニングによる汎用視覚言語モデルに向けて

さまざまな言語ドメインのタスクを解決できる汎用言語モデルが、事前トレーニングおよび命令チューニングパイプラインによって駆動されて登場しました。ただし、追加の視覚入力によってタスクの不一致が増加するため、汎用の視覚言語モデルを構築することは困難です。視覚言語の事前トレーニングは広く研究されていますが、視覚言語の指導の調整については比較的研究が進んでいません。この論文では、事前に訓練された BLIP-2 モデルに基づいて視覚言語命令の調整に関する体系的かつ包括的な研究を実施します。私たちは、公開されている 26 個のさまざまなデータセットを収集し、それらを命令チューニング形式に変換し、ホールドイン命令チューニングとホールドアウトゼロショット評価用の 2 つのクラスターに分類します。さらに、モデルが指定された命令に合わせて調整された有益な特徴を抽出できるようにする重要な方法である、命令を意識した視覚特徴抽出を導入します。結果として得られた InstructBLIP モデルは、13 個の保持されたデータセットすべてにわたって最先端のゼロショットパフォーマンスを達成し、BLIP-2 や大型の Flamingo を大幅に上回りました。また、当社のモデルは、個々の下流タスクに合わせて微調整すると、最先端のパフォーマンスを実現します (たとえば、ScienceQA IMG での精度 90.7%)。さらに、同時マルチモーダルモデルに対する InstructBLIP の利点を定性的に実証します。すべての InstructBLIP モデルは、https://github.com/salesforce/LAVIS/tree/main/projects/instructblip でオープンソース化されています。

General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models have been open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

updated: Thu May 11 2023 00:38:10 GMT+0000 (UTC)

published: Thu May 11 2023 00:38:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト