Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Min Peng; Chongyang Wang; Yuan Gao; Yu Shi; Xiang-Dong Zhou

ビデオ質問応答のためのマルチモーダルインタラクションを備えた時間ピラミッドトランスフォーマー

ビデオ質問応答（VideoQA）は、視覚的理解と自然言語理解のマルチモーダルな組み合わせを考えると、困難です。既存のアプローチでは、ビデオ内の外観と動きの情報を複数の時間スケールで利用することはめったにありませんが、テキストセマンティクス抽出のための質問と視覚情報の間の相互作用はしばしば無視されます。これらの問題を対象として、このペーパーでは、VideoQAのマルチモーダルインタラクションを備えた新しいTemporal Pyramid Transformer（TPT）モデルを提案します。 TPTモデルは、質問固有のトランスフォーマー（QT）と視覚的推論（VI）の2つのモジュールで構成されています。ビデオから構築された時間ピラミッドを考えると、QTは、各単語と視覚的コンテンツの間の粗いものから細かいものへのマルチモーダル共起から質問のセマンティクスを構築します。このような質問固有のセマンティクスのガイダンスの下で、VIは、質問とビデオの間のローカルからグローバルへのマルチレベルの相互作用から視覚的な手がかりを推測します。各モジュール内で、質問とビデオの相互作用の抽出を支援するマルチモーダル注意メカニズムを導入し、さまざまなレベルを通過する情報に残りの接続を採用します。 3つのVideoQAデータセットでの広範な実験を通じて、最先端の方法と比較して、提案された方法の優れたパフォーマンスを示します。

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding. While existing approaches seldom leverage the appearance-motion information in the video at multiple temporal scales, the interaction between the question and the visual information for textual semantics extraction is frequently ignored. Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA. The TPT model comprises two modules, namely Question-specific Transformer (QT) and Visual Inference (VI). Given the temporal pyramid constructed from a video, QT builds the question semantics from the coarse-to-fine multimodal co-occurrence between each word and the visual content. Under the guidance of such question-specific semantics, VI infers the visual clues from the local-to-global multi-level interactions between the question and the video. Within each module, we introduce a multimodal attention mechanism to aid the extraction of question-video interactions, with residual connections adopted for the information passing across different levels. Through extensive experiments on three VideoQA datasets, we demonstrate better performances of the proposed method in comparison with the state-of-the-arts.

updated: Fri Sep 10 2021 08:31:58 GMT+0000 (UTC)

published: Fri Sep 10 2021 08:31:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト