Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Min Peng; Chongyang Wang; Yu Shi; Xiang-Dong Zhou

ピラミッド型マルチモーダルトランスフォーマーによる効率的なエンドツーエンドのビデオ質問応答

このホワイトペーパーでは、エンドツーエンドのビデオ質問応答 (VideoQA) の新しい方法を紹介します。これは、巨大な特徴抽出器を使用した大規模な事前トレーニングを使用する現在の人気とは別のものです。これは、ピラミッド型のマルチモーダルトランスフォーマー (PMT) モデルを使用して実現します。このモデルは、学習可能な単語埋め込みレイヤー、いくつかの畳み込みレイヤーおよびトランスフォーマーレイヤーを単純に組み込んだものです。異方性ピラミッドを使用して、さまざまな時空間スケールでビデオと言語の相互作用を実現します。横方向の接続を持つボトムアップとトップダウンの両方の経路を含む標準的なピラミッドに加えて、視覚的特徴ストリームをさまざまなスケールで空間的および時間的サブストリームに分解し、言語セマンティクスとの相互作用を実装するための新しい戦略が提案されています。ローカルおよびグローバルのセマンティクスの整合性を維持しながら。 5 つの VideoQA ベンチマークで、最先端の方法に対して高い計算効率で、より優れた、または同等のパフォーマンスを示しています。私たちのアブレーション研究は、再利用可能な事前トレーニング済みの重みを備えた特徴抽出器を活用することにより、テキストからビデオへの検索で競争力のある結果を達成するモデルのスケーラビリティと、ピラミッドの有効性を示しています。

This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid.

updated: Sun Mar 05 2023 10:09:11 GMT+0000 (UTC)

published: Sat Feb 04 2023 09:14:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト