A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction

Gamze Akyol; Sanem Sariel; Eren Erdal Aksoy

操作行動の認識と予測のための変分グラフオートエンコーダ

何十年にもわたる研究にもかかわらず、人間の操作活動を理解することは、コンピュータービジョンとロボット工学において最も魅力的で挑戦的な研究トピックの1つであり、これまでもずっとそうです。観察された人間の操作動作の認識と予測は、たとえば、人間とロボットの相互作用やデモンストレーションからのロボット学習に関連するアプリケーションにルーツがあります。現在の研究動向は、RGBカメラ画像などの構造化されたユークリッドデータを処理するために、高度な畳み込みニューラルネットワークに大きく依存しています。ただし、これらのネットワークには、高次元の生データを処理できるようにするための計算が非常に複雑です。関連する作品とは異なり、ここでは、構造化されたユークリッドデータに依存する代わりに、シンボリックシーングラフから操作タスクの認識と予測を共同で学習するディープグラフオートエンコーダーを紹介します。私たちのネットワークには、2つのブランチを持つ変分オートエンコーダ構造があります。1つは入力グラフタイプを識別するためのもので、もう1つは将来のグラフを予測するためのものです。提案されたネットワークの入力は、シーン内の被写体とオブジェクトの間の空間的関係を格納するセマンティックグラフのセットです。ネットワーク出力は、検出および予測されたクラスタイプを表すラベルセットです。新しいモデルを、MANIACとMSRC-9の2つの異なるデータセットで異なる最先端の方法に対してベンチマークし、提案されたモデルがより優れたパフォーマンスを達成できることを示します。ソースコードhttps://github.com/gamzeakyol/GNetもリリースしています。

Despite decades of research, understanding human manipulation activities is, and has always been, one of the most attractive and challenging research topics in computer vision and robotics. Recognition and prediction of observed human manipulation actions have their roots in the applications related to, for instance, human-robot interaction and robot learning from demonstration. The current research trend heavily relies on advanced convolutional neural networks to process the structured Euclidean data, such as RGB camera images. These networks, however, come with immense computational complexity to be able to process high dimensional raw data. Different from the related works, we here introduce a deep graph autoencoder to jointly learn recognition and prediction of manipulation tasks from symbolic scene graphs, instead of relying on the structured Euclidean data. Our network has a variational autoencoder structure with two branches: one for identifying the input graph type and one for predicting the future graphs. The input of the proposed network is a set of semantic graphs which store the spatial relations between subjects and objects in the scene. The network output is a label set representing the detected and predicted class types. We benchmark our new model against different state-of-the-art methods on two different datasets, MANIAC and MSRC-9, and show that our proposed model can achieve better performance. We also release our source code https://github.com/gamzeakyol/GNet.

updated: Mon Oct 25 2021 21:40:42 GMT+0000 (UTC)

published: Mon Oct 25 2021 21:40:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト