Planting a SEED of Vision in Large Language Model

Yuying Ge; Yixiao Ge; Ziyun Zeng; Xintao Wang; Ying Shan

大規模言語モデルにビジョンの種を植える

私たちは、大規模言語モデル (LLM) に SEE と Draw を同時に実行する新しい機能を提供する、精巧な画像トークナイザーである SEED を紹介します。画像トークナイザーに関する研究は、量子化されたビジュアルトークンを使用するフレームワークが、マルチモーダル理解（BLIP-2などと比較）または生成（安定拡散などと比較）におけるパフォーマンスと収束が標準以下であるために目立たなくなり、以前は行き詰まりに達していました。制限があるにもかかわらず、私たちは視覚表現とテキスト表現を統合し、LLM のオリジナルレシピによるスケーラブルなマルチモーダルトレーニングを促進するその自然な能力に自信を持っています。この研究では、その後の LLM との調整を効果的に容易にする SEED のアーキテクチャとトレーニングに関する 2 つの重要な原則を特定します。 (1) 画像トークンは 2D 物理パッチ位置から独立している必要があり、代わりに 1D 因果関係に基づいて生成され、LLM の左から右への自己回帰予測メカニズムと一致する本質的な相互依存性を示します。 (2) 画像トークンは、単語の意味抽象化の程度と一致する高レベルの意味論をキャプチャし、トークナイザーのトレーニング段階で識別性と再構成の両方について最適化される必要があります。その結果、効率的な LoRA チューニングを通じて当社の SEED を組み込むことで、既製の LLM は画像からテキストへの生成とテキストから画像への生成の両方を実行できるようになります。包括的なマルチモーダル事前トレーニングと命令調整は、結果の向上をもたらす可能性があり、将来の調査のために保留されています。このバージョンの SEED は、64 個の V100 GPU と 500 万の公開されている画像とテキストのペアのみを使用して、5.7 日間でトレーニングされました。私たちの予備調査では、多用途のマルチモーダル LLM における離散ビジュアルトークンの大きな可能性と、より広範な研究における適切な画像トークナイザーの重要性が強調されています。

We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.

updated: Sat Aug 12 2023 04:42:29 GMT+0000 (UTC)

published: Sun Jul 16 2023 13:41:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト