BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

W. Hu; Y. Xu; Y. Li; W. Li; Z. Chen; Z. Tu

BLIVA: テキストが豊富な視覚的な質問をより適切に処理するためのシンプルなマルチモーダル LLM

視覚理解機能を組み込むことで大規模言語モデル (LLM) を拡張したビジョン言語モデル (VLM) は、オープンエンドの視覚的質問応答 (VQA) タスクへの対処において大幅な進歩を示しました。ただし、これらのモデルは、現実世界のシナリオではよくあることですが、テキストが注入された画像を正確に解釈できません。画像から情報を抽出するための標準的な手順には、多くの場合、クエリ埋め込みの固定セットの学習が含まれます。これらの埋め込みは画像コンテキストをカプセル化するように設計されており、後で LLM のソフトプロンプト入力として使用されます。ただし、このプロセスはトークン数に制限されているため、テキストが豊富なコンテキストを含むシーンの認識が抑制される可能性があります。これらを改善するために、本研究では、Visual Assistant を備えた InstructBLIP の拡張バージョンである BLIVA を導入します。 BLIVA は、InstructBLIP からのクエリ埋め込みを組み込み、エンコードされたパッチ埋め込みを LLaVA からインスピレーションを得た手法である LLM に直接投影します。このアプローチは、モデルがクエリのデコードプロセス中に見逃される可能性のある複雑な詳細を捕捉するのに役立ちます。経験的証拠は、当社のモデルである BLIVA が、テキストの多い VQA ベンチマークの処理 (OCR-VQA ベンチマークで最大 17.76%) および典型的な VQA ベンチマークの実行 (Visual Spatial Reasoning ベンチマークで最大 7.9%) のパフォーマンスを大幅に向上させることを示しています。私たちのベースライン InstructBLIP。 BLIVA は、テキストの存在に関係なく、現実世界の画像をデコードする際に優れた機能を発揮します。 BLIVA によって可能になる広範な業界アプリケーションを実証するために、13 の多様なカテゴリにわたる質問と回答のセットとペアになった YouTube サムネイルで構成される新しいデータセットを使用してモデルを評価します。さらなる調査に興味のある研究者は、https://github.com/mlpc-ucsd/BLIVA.git からコードとモデルに自由にアクセスできます。

Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA benchmark) and in undertaking typical VQA benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 13 diverse categories. For researchers interested in further exploration, our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA.git

updated: Sat Aug 19 2023 07:53:43 GMT+0000 (UTC)

published: Sat Aug 19 2023 07:53:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト