End-to-end Multi-modal Video Temporal Grounding

Yi-Wen Chen; Yi-Hsuan Tsai; Ming-Hsuan Yang

エンドツーエンドのマルチモーダルビデオ時間接地

自然言語の説明に基づいて特定のイベントの時間間隔を特定することを目的とした、テキストガイド付きビデオの時間的接地の問題に対処します。 RGB画像のみを視覚的特徴と見なす既存のほとんどの方法とは異なり、ビデオから補足情報を抽出するためのマルチモーダルフレームワークを提案します。具体的には、外観にはRGB画像、動きにはオプティカルフロー、画像構造には深度マップを採用しています。 RGB画像は特定のイベントの豊富な視覚的手がかりを提供しますが、パフォーマンスは背景の乱雑さの影響を受ける可能性があります。したがって、オプティカルフローを使用して大きなモーションマップと深度マップに焦点を合わせ、アクションがその形状で認識可能なオブジェクトに関連している場合にシーン構成を推測します。 3つのモダリティをより効果的に統合し、インターモーダル学習を可能にするために、トランスフォーマーを使用した動的融合スキームを設計し、モダリティ間の相互作用をモデル化します。さらに、モーダル内の自己教師あり学習を適用して、各モダリティのビデオ全体の特徴表現を強化します。これにより、マルチモーダル学習も容易になります。 Charades-STAおよびActivityNetCaptionsデータセットで広範な実験を実施し、提案された方法が最先端のアプローチに対して有利に機能することを示します。

We address the problem of text-guided video temporal grounding, which aims to identify the time interval of certain event based on a natural language description. Different from most existing methods that only consider RGB images as visual features, we propose a multi-modal framework to extract complementary information from videos. Specifically, we adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. While RGB images provide abundant visual cues of certain event, the performance may be affected by background clutters. Therefore, we use optical flow to focus on large motion and depth maps to infer the scene configuration when the action is related to objects recognizable with their shapes. To integrate the three modalities more effectively and enable inter-modal learning, we design a dynamic fusion scheme with transformers to model the interactions between modalities. Furthermore, we apply intra-modal self-supervised learning to enhance feature representations across videos for each modality, which also facilitates multi-modal learning. We conduct extensive experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.

updated: Mon Jul 12 2021 17:58:10 GMT+0000 (UTC)

published: Mon Jul 12 2021 17:58:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト