ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

Kunyu Peng; Alina Roitberg; Kailun Yang; Jiaming Zhang; Rainer Stiefelhagen

ProFormer：プロトタイプベースの機能拡張とビジュアルトランスフォーマーを使用した体の動きのデータ効率の高い表現の学習

人間の行動を自動的に理解することで、家庭用ロボットは最も重要なニーズを特定し、現在の状況に応じて人間を支援する方法を計画できます。ただし、そのような方法の大部分は、関心のあるすべての概念に対して大量のラベル付きトレーニング例が利用可能であるという前提の下で開発されています。一方、ロボットは絶えず変化する非構造化環境で動作し、ごく少数のサンプルからの新しいアクションカテゴリに適応する必要があります。身体のポーズからデータ効率の高い認識を行う方法では、画像のような配列として構造化され、畳み込みニューラルネットワークへの入力として使用されるスケルトンシーケンスがますます活用されています。このパラダイムをトランスフォーマーネットワークの観点から見て、スケルトンの動きのデータ効率の高いエンコーダーとしてビジュアルトランスフォーマーを初めて検討します。私たちのパイプラインでは、画像のような表現としてキャストされた体のポーズシーケンスがパッチの埋め込みに変換され、詳細なメトリック学習で最適化されたビジュアルトランスボーンバックボーンに渡されます。半教師あり学習における機能強化方法の最近の成功に触発されて、ProFormerをさらに紹介します。これは、埋め込みを拡張し、補助的な一貫性の損失を計算するために使用される、反復的に推定されるアクションカテゴリのプロトタイプに適用されるソフトアテンションを使用する改善されたトレーニング戦略です。広範な実験は、体のポーズからのワンショット認識に対するアプローチの有効性を一貫して示しており、複数のデータセットで最先端の結果を達成し、挑戦的なNTU-120ワンショットベンチマークで公開されている最良のアプローチを1.84％上回っています。私たちのコードはhttps://github.com/KPeng9510/ProFormerで公開されます。

Automatically understanding human behaviour allows household robots to identify the most critical needs and plan how to assist the human according to the current situation. However, the majority of such methods are developed under the assumption that a large amount of labelled training examples is available for all concepts-of-interest. Robots, on the other hand, operate in constantly changing unstructured environments, and need to adapt to novel action categories from very few samples. Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays and then used as input to convolutional neural networks. We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement. In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning. Inspired by recent success of feature enhancement methods in semi-supervised learning, we further introduce ProFormer -- an improved training strategy which uses soft-attention applied on iteratively estimated action category prototypes used to augment the embeddings and compute an auxiliary consistency loss. Extensive experiments consistently demonstrate the effectiveness of our approach for one-shot recognition from body poses, achieving state-of-the-art results on multiple datasets and surpassing the best published approach on the challenging NTU-120 one-shot benchmark by 1.84%. Our code will be made publicly available at https://github.com/KPeng9510/ProFormer.

updated: Wed Feb 23 2022 11:11:54 GMT+0000 (UTC)

published: Wed Feb 23 2022 11:11:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト