Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering

Zenan Xu; Wanjun Zhong; Qinliang Su; Zijing Ou; Fuwei Zhang

ビデオ質問応答のための構文ハイパーグラフを使用したセマンティック構成のモデリング

ビデオの質問応答における重要な課題は、テキストの概念と対応するビジュアルオブジェクトの間のクロスモーダルセマンティックアラインメントをどのように実現するかです。既存の方法は、ほとんどの場合、単語の表現をビデオ領域に合わせようとします。ただし、単語表現では、特定の単語の構成によって一般的に説明されるテキストの概念の完全な説明を伝えることができないことがよくあります。この問題に対処するために、まず、既製のツールを使用して各質問の構文依存関係ツリーを構築し、それを使用して意味のある単語構成の抽出をガイドすることを提案します。抽出された構成に基づいて、単語をノードとして、構成をハイパーエッジとして表示することにより、ハイパーグラフがさらに作成されます。次に、ハイパーグラフ畳み込みネットワーク（HCN）を使用して、単語構成の初期表現を学習します。その後、テキストおよび視覚的セマンティック空間のクロスモーダルセマンティックアラインメントを実行するために、最適なトランスポートベースの方法が提案されます。クロスモーダルの影響を反映するために、クロスモーダル情報が初期表現に組み込まれ、クロスモダリティ対応の構文HCNという名前のモデルが作成されます。 3つのベンチマークでの実験結果は、私たちの方法がすべての強力なベースラインを上回っていることを示しています。さらなる分析は、各コンポーネントの有効性を示し、私たちのモデルがさまざまなレベルのセマンティック構成をモデル化し、無関係な情報を除外するのに優れていることを示しています。

A key challenge in video question answering is how to realize the cross-modal semantic alignment between textual concepts and corresponding visual objects. Existing methods mostly seek to align the word representations with the video regions. However, word representations are often not able to convey a complete description of textual concepts, which are in general described by the compositions of certain words. To address this issue, we propose to first build a syntactic dependency tree for each question with an off-the-shelf tool and use it to guide the extraction of meaningful word compositions. Based on the extracted compositions, a hypergraph is further built by viewing the words as nodes and the compositions as hyperedges. Hypergraph convolutional networks (HCN) are then employed to learn the initial representations of word compositions. Afterwards, an optimal transport based method is proposed to perform cross-modal semantic alignment for the textual and visual semantic space. To reflect the cross-modal influences, the cross-modal information is incorporated into the initial representations, leading to a model named cross-modality-aware syntactic HCN. Experimental results on three benchmarks show that our method outperforms all strong baselines. Further analyses demonstrate the effectiveness of each component, and show that our model is good at modeling different levels of semantic compositions and filtering out irrelevant information.

updated: Fri May 13 2022 09:28:13 GMT+0000 (UTC)

published: Fri May 13 2022 09:28:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト