Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Long Hoang Dang; Thao Minh Le; Vuong Le; Truyen Tran

ビデオ質問応答のための階層的オブジェクト指向時空間推論

ビデオ質問応答（ビデオQA）は、新しいAI機能を開発するための強力なテストベッドです。このタスクでは、時空の視覚的および言語的ドメイン全体のオブジェクト、関係、およびイベントについて推論することを学ぶ必要があります。高レベルの推論では、連想的な視覚パターン認識から、オブジェクト、その動作、および相互作用に対するシンボルのような操作に移行する必要があります。この目標の達成に向けて、ビデオが相互作用するオブジェクトの動的ストリームとして抽象化されるというオブジェクト指向の推論アプローチを提案します。ビデオイベントフローの各段階で、これらのオブジェクトは相互作用し、それらの相互作用は、クエリに関して、およびビデオの全体的なコンテキストの下で推論されます。このメカニズムは、汎用ニューラルユニットのファミリと、階層オブジェクト指向時空間推論（HOSTR）ネットワークと呼ばれるそれらのマルチレベルアーキテクチャに具体化されます。このニューラルモデルは、階層的にネストされた時空間グラフの形式でオブジェクトの一貫したライフラインを維持します。このグラフ内で、動的なインタラクティブなオブジェクト指向表現がビデオシーケンスに沿って構築され、ボトムアップ方式で階層的に抽象化され、正解の重要な情報に向かって収束します。この方法は、複数の主要なビデオQAデータセットで評価され、これらのタスクで新しい最先端技術を確立します。モデルの動作を分析すると、オブジェクト指向の推論が、ビデオQAに対する信頼性が高く、解釈可能で、効率的なアプローチであることがわかります。

Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities. This task necessitates learning to reason about objects, relations, and events across visual and linguistic domains in space-time. High-level reasoning demands lifting from associative visual pattern recognition to symbol-like manipulation over objects, their behavior and interactions. Toward reaching this goal we propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects. At each stage of the video event flow, these objects interact with each other, and their interactions are reasoned about with respect to the query and under the overall context of a video. This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture called Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) networks. This neural model maintains the objects' consistent lifelines in the form of a hierarchically nested spatio-temporal graph. Within this graph, the dynamic interactive object-oriented representations are built up along the video sequence, hierarchically abstracted in a bottom-up manner, and converge toward the key information for the correct answer. The method is evaluated on multiple major Video QA datasets and establishes new state-of-the-arts in these tasks. Analysis into the model's behavior indicates that object-oriented reasoning is a reliable, interpretable and efficient approach to Video QA.

updated: Fri Jun 25 2021 05:12:42 GMT+0000 (UTC)

published: Fri Jun 25 2021 05:12:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト