Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Min Peng; Chongyang Wang; Yuan Gao; Yu Shi; Xiang-Dong Zhou

ビデオ質問応答のためのマルチスケールサンプリングを備えたマルチレベル階層ネットワーク

ビデオ質問応答（VideoQA）は、視覚的理解と自然言語処理のマルチモーダルな組み合わせを考えると、困難です。ほとんどの既存のアプローチは、さまざまな時間スケールでの視覚的外観-動き情報を無視しますが、深層学習モデルのマルチレベル処理能力をそのようなマルチスケール情報に組み込む方法は不明です。これらの問題を対象として、この論文では、VideoQAのマルチスケールサンプリングを備えた新しいマルチレベル階層ネットワーク（MHN）を提案します。 MHNは、2つのモジュール、つまりRecurrent Multimodal Interaction（RMI）とParallel Visual Reasoning（PVR）で構成されています。マルチスケールサンプリングを使用すると、RMIは、各スケールでの外観モーション情報と質問の埋め込みの相互作用を繰り返して、マルチレベルの質問ガイド付き視覚表現を構築します。その上で、共有トランスフォーマーエンコーダーを使用して、PVRは、関連するレベルの視覚情報に依存する可能性のあるさまざまな質問タイプに答えるために、各レベルの視覚的手がかりを並行して推測します。 3つのVideoQAデータセットでの広範な実験を通じて、以前の最先端技術よりも優れたパフォーマンスを実証し、メソッドの各部分の有効性を正当化します。

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language processing. While most existing approaches ignore the visual appearance-motion information at different temporal scales, it is unknown how to incorporate the multilevel processing capacity of a deep learning model with such multiscale information. Targeting these issues, this paper proposes a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR). With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. Thereon, with a shared transformer encoder, PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels. Through extensive experiments on three VideoQA datasets, we demonstrate improved performances than previous state-of-the-arts and justify the effectiveness of each part of our method.

updated: Mon May 09 2022 06:28:56 GMT+0000 (UTC)

published: Mon May 09 2022 06:28:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト