Object-Centric Representation Learning for Video Question Answering

Long Hoang Dang; Thao Minh Le; Vuong Le; Truyen Tran

ビデオ質問応答のためのオブジェクト中心表現学習

ビデオ質問応答（ビデオQA）は、人間のようなインテリジェントな行動のための強力なテストベッドを提供します。このタスクには、ビデオ処理、言語理解、抽象的な言語概念の具体的な視覚的アーティファクトへのバインド、および時空にわたる熟慮的な推論を統合するための新しい機能が必要です。ニューラルネットワークは、機能やルールを手作りするのではなく、例から学ぶことでこの可能性に到達するための有望なアプローチを提供します。ただし、ニューラルネットワークは主に機能ベースです。データを非構造化ベクトル表現にマッピングするため、シンボリックシステムで見られる真の体系的な推論ではなく、表面統計を通じてショートカットを悪用するという罠に陥る可能性があります。この問題に取り組むために、ビデオから時空間構造を構築するための基礎としてオブジェクト中心の表現を提唱し、基本的に低レベルのパターン認識と高レベルの記号代数の間のセマンティックギャップを埋めます。この目的のために、ビデオをオブジェクトの進化するリレーショナルグラフに変換するための新しいクエリガイド表現フレームワークを提案します。オブジェクトの機能と相互作用は動的かつ条件付きで推測されます。次に、オブジェクトの寿命は履歴書に要約され、クエリへの回答を生成する意図的なリレーショナル推論に自然に役立ちます。フレームワークは主要なビデオQAデータセットで評価され、ビデオ推論へのオブジェクト中心のアプローチの明確な利点を示しています。

Video question answering (Video QA) presents a powerful testbed for human-like intelligent behaviors. The task demands new capabilities to integrate video processing, language understanding, binding abstract linguistic concepts to concrete visual artifacts, and deliberative reasoning over spacetime. Neural networks offer a promising approach to reach this potential through learning from examples rather than handcrafting features and rules. However, neural networks are predominantly feature-based - they map data to unstructured vectorial representation and thus can fall into the trap of exploiting shortcuts through surface statistics instead of true systematic reasoning seen in symbolic systems. To tackle this issue, we advocate for object-centric representation as a basis for constructing spatio-temporal structures from videos, essentially bridging the semantic gap between low-level pattern recognition and high-level symbolic algebra. To this end, we propose a new query-guided representation framework to turn a video into an evolving relational graph of objects, whose features and interactions are dynamically and conditionally inferred. The object lives are then summarized into resumes, lending naturally for deliberative relational reasoning that produces an answer to the query. The framework is evaluated on major Video QA datasets, demonstrating clear benefits of the object-centric approach to video reasoning.

updated: Fri Jul 09 2021 00:06:59 GMT+0000 (UTC)

published: Mon Apr 12 2021 02:37:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト