Recurrent Space-time Graph Neural Networks

Andrei Nicolicioiu; Iulia Duta; Marius Leordeanu

リカレント時空グラフニューラルネットワーク

時空領域での学習は、機械学習とコンピュータービジョンにおいて非常に困難な問題のままです。時空間視覚データを理解するための現在の計算モデルは、古典的な単一画像ベースのパラダイムに大きく根ざしています。空間と時間の情報を単一の一般的なモデルに統合する方法はまだよく理解されていません。私たちは、変化する世界のシーン内の異なるエンティティとオブジェクトのローカルな外観と複雑な高レベルの相互作用の両方をキャプチャするのに適した、空間と時間で繰り返されるニューラルグラフモデルを提案します。グラフのノードとエッジには、情報を処理するための専用のニューラルネットワークがあります。ノードは、空間と時間および以前のメモリ状態でローカルパーツから抽出された機能を操作します。エッジは、異なる場所と空間スケールで接続されたノード間、または過去と現在の間でメッセージを処理します。メッセージは、情報をグローバルに送信し、長距離の相互作用を確立するために、繰り返し渡されます。私たちのモデルは一般的であり、さまざまな高レベルの時空間概念を認識し、異なる学習タスクに適用することを学ぶことができます。広範な実験とアブレーションの研究を通じて、私たちのモデルは、ビデオで複雑なアクティビティを認識する際の強力なベースラインと公開された手法よりも優れていることを示しています。さらに、やりがいのあるSomething-Somethingヒューマンオブジェクトインタラクションデータセットで最先端のパフォーマンスを取得します。

Learning in the space-time domain remains a very challenging problem in machine learning and computer vision. Current computational models for understanding spatio-temporal visual data are heavily rooted in the classical single-image based paradigm. It is not yet well understood how to integrate information in space and time into a single, general model. We propose a neural graph model, recurrent in space and time, suitable for capturing both the local appearance and the complex higher-level interactions of different entities and objects within the changing world scene. Nodes and edges in our graph have dedicated neural networks for processing information. Nodes operate over features extracted from local parts in space and time and previous memory states. Edges process messages between connected nodes at different locations and spatial scales or between past and present time. Messages are passed iteratively in order to transmit information globally and establish long range interactions. Our model is general and could learn to recognize a variety of high level spatio-temporal concepts and be applied to different learning tasks. We demonstrate, through extensive experiments and ablation studies, that our model outperforms strong baselines and top published methods on recognizing complex activities in video. Moreover, we obtain state-of-the-art performance on the challenging Something-Something human-object interaction dataset.

updated: Mon Dec 23 2019 15:18:38 GMT+0000 (UTC)

published: Thu Apr 11 2019 08:51:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト