Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

Zineng Tang; Jaemin Cho; Jie Lei; Mohit Bansal

Perceiver-VL: 反復潜在的注意による効率的な視覚と言語のモデリング

長いビデオやテキストなどの高次元のマルチモーダル入力を効率的に処理するビジョンと言語のフレームワークである Perceiver-VL を紹介します。 Perceiver の反復的な潜在的クロスアテンションを利用して、私たちのフレームワークは線形の複雑さでスケーリングします。これは、多くの最先端のトランスフォーマーベースのモデルで使用されるセルフアテンションの二次的な複雑さとは対照的です。フレームワークの効率をさらに改善するために、クロスアテンションレイヤーにLayerDropを適用することも研究し、クロスモーダル検索のための混合ストリームアーキテクチャを導入します。 Perceiver-VL は、競争力のあるパフォーマンスを維持しながら、最小の GFLOP とレイテンシを達成するさまざまなビデオテキストおよび画像テキストベンチマークで Perceiver-VL を評価します。さらに、データの事前トレーニング、潜在サイズと入力サイズのスケーラビリティ、レイテンシを削減するための推論時のクロスアテンションレイヤーのドロップ、モダリティ集約戦略、位置エンコーディング、重み初期化戦略など、フレームワークのさまざまな側面の包括的な分析も提供します。私たちのコードとチェックポイントは、https://github.com/zinengtang/Perceiver_VL で入手できます。

We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text. Powered by the iterative latent cross-attention of Perceiver, our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models. To further improve the efficiency of our framework, we also study applying LayerDrop on cross-attention layers and introduce a mixed-stream architecture for cross-modal retrieval. We evaluate Perceiver-VL on diverse video-text and image-text benchmarks, where Perceiver-VL achieves the lowest GFLOPs and latency while maintaining competitive performance. In addition, we also provide comprehensive analyses of various aspects of our framework, including pretraining data, scalability of latent size and input size, dropping cross-attention layers at inference to reduce latency, modality aggregation strategy, positional encoding, and weight initialization strategy. Our code and checkpoints are available at: https://github.com/zinengtang/Perceiver_VL

updated: Mon Nov 21 2022 18:22:39 GMT+0000 (UTC)

published: Mon Nov 21 2022 18:22:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト