BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

Chaoya Jiang; Haiyang Xu; Wei Ye; Qinghao Ye; Chenliang Li; Ming Yan; Bin Bi; Shikun Zhang; Fei Huang; Songfang Huang

BUS:ボトムアップのパッチ要約による効率的かつ効果的なビジョン言語の事前トレーニング

Vision Transformer (ViT) ベースの Vision-Language Pre-training (VLP) モデルは、さまざまなタスクで優れたパフォーマンスを実証しています。ただし、ViT に入力される長いビジュアルトークンシーケンスは、トレーニングの非効率性や非効率性につながる可能性があります。既存の取り組みでは、ViT バックボーン内の最下位レベルのパッチ抽出か、外部のトップレベルのパッチ抽象化のいずれかによって課題に対処しており、トレーニングの効率と有効性のバランスが取れていません。自然言語処理におけるテキスト要約からインスピレーションを得て、私たちは、BUS という名前のボトムアップパッチ要約アプローチを提案します。これは、ボトムレベルの抽出とトップレベルの抽象化を調整して、長いビジュアルトークンシーケンスの簡潔な要約を効率的に学習します。具体的には、Text-Semantics-Aware Patch Selector (TSPS) を ViT バックボーンに組み込んで、粗粒度のビジュアルトークン抽出を実行し、その後、トップレベルのビジュアルのために柔軟な Transformer ベースの Patch Abstraction Decoder (PAD) をバックボーンに接続します。抽象化。このボトムアップのコラボレーションにより、当社の BUS は、有効性を維持または向上させながら、高いトレーニング効率を実現することができます。私たちはさまざまな視覚言語の理解および生成タスクに対するアプローチを評価し、トレーニング効率を 50% 向上させながら、競争力のある下流タスクのパフォーマンスを示します。さらに、私たちのモデルは、計算コストをベースラインよりも増加させることなく入力画像の解像度を高めることにより、多くの下流タスクで最先端のパフォーマンスを実現します。

Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have demonstrated impressive performance in various tasks. However, the lengthy visual token sequences fed into ViT can lead to training inefficiency and ineffectiveness. Existing efforts address the challenge by either bottom-level patch extraction in the ViT backbone or top-level patch abstraction outside, not balancing training efficiency and effectiveness well. Inspired by text summarization in natural language processing, we propose a Bottom-Up Patch Summarization approach named BUS, coordinating bottom-level extraction and top-level abstraction to learn a concise summary of lengthy visual token sequences efficiently. Specifically, We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction and then attach a flexible Transformer-based Patch Abstraction Decoder (PAD) upon the backbone for top-level visual abstraction. This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness. We evaluate our approach on various visual-language understanding and generation tasks and show competitive downstream task performance while boosting the training efficiency by 50%. Additionally, our model achieves state-of-the-art performance on many downstream tasks by increasing input image resolution without increasing computational costs over baselines.

updated: Mon Jul 17 2023 14:08:17 GMT+0000 (UTC)

published: Mon Jul 17 2023 14:08:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト