Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Shijie Geng; Peng Gao; Moitreya Chatterjee; Chiori Hori; Jonathan Le Roux; Yongfeng Zhang; Hongsheng Li; Anoop Cherian

マルチモーダルシャッフルトランスフォーマーを介したビデオダイアログの動的グラフ表現学習

入力ビデオ、それに関連するオーディオ、および簡単なキャプションが与えられると、オーディオビジュアルシーン認識ダイアログ（AVSD）タスクでは、エージェントがオーディオビジュアルコンテンツについて人間との質疑応答ダイアログにふける必要があります。したがって、このタスクは、挑戦的なマルチモーダル表現学習および推論シナリオを提起し、その進歩は、いくつかのヒューマンマシンインタラクションアプリケーションに影響を与える可能性があります。このタスクを解決するために、セマンティクス制御のマルチモーダルシャッフルトランスフォーマー推論フレームワークを導入します。これは、トランスフォーマーモジュールのシーケンスで構成され、それぞれがモダリティを入力として受け取り、入力質問を条件とする表現を生成します。提案されたTransformerバリアントは、マルチヘッド出力でシャッフリングスキームを使用しており、より優れた正則化を示しています。きめ細かい視覚情報をエンコードするために、すべてのフレームの空間セマンティックグラフ表現を生成するフレーム内推論レイヤーと時間的手がかりをキャプチャするフレーム間集約モジュールで構成される新しい動的シーングラフ表現学習パイプラインを提示します。パイプライン全体がエンドツーエンドでトレーニングされています。回答の生成と選択の両方のタスクについて、ベンチマークAVSDデータセットに関する実験を示します。私たちの結果は、すべての評価指標で最先端のパフォーマンスを示しています。

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.

updated: Tue Mar 02 2021 20:04:33 GMT+0000 (UTC)

published: Wed Jul 08 2020 02:00:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト