Delving Deep into One-Shot Skeleton-based Action Recognition with Diverse Occlusions

Kunyu Peng; Alina Roitberg; Kailun Yang; Jiaming Zhang; Rainer Stiefelhagen

多様なオクルージョンを使用したワンショットスケルトンベースのアクション認識を深く掘り下げる

オクルージョンは、現実の世界に常に存在する普遍的な混乱です。特に人間の骨格などのまばらな表現の場合、いくつかの遮蔽されたポイントが、結果に重大な影響を与える幾何学的および時間的連続性を破壊する可能性があります。それでも、ワンショットアクション認識などのスケルトンシーケンスからのデータ不足認識の研究では、日常的な普及にもかかわらず、オクルージョンを明示的に考慮していません。この作業では、スケルトンベースのワンショットアクション認識（SOAR）の身体閉塞に明示的に取り組みます。私たちは主に2つのオクルージョンバリアントを検討します：1）ランダムオクルージョンと2）既存のIKEA3D家具モデルを3Dスケルトンのカメラ座標系に投影することによって生成する多様な日常オブジェクトによって引き起こされるより現実的なオクルージョン。提案されたパイプラインを活用して、3つの人気のあるアクション認識データセット（NTU-120、NTU-60、Toyota Smart Home）のスケルトンシーケンスの一部をブレンドし、部分的に閉塞した体のポーズからSOARの最初のベンチマークを形式化します。これは、データが不足している行動認識のオクルージョンを考慮した最初のベンチマークです。 3Dスケルトンからの標準的な認識でも、ランダムに欠落している関節のみが考慮されたため、ベンチマークのもう1つの重要な特性は、日常のオブジェクトによって生成されるより現実的なオクルージョンです。この新しいタスクに照らして最先端のフレームワークを再評価し、さらに3つのデータストリームと混合注意融合メカニズムを活用してオクルージョンによって引き起こされる悪影響を軽減する新しいトランスベースモデルであるTrans4SOARを紹介します。私たちの実験では、スケルトン部分が欠落していると精度が明らかに低下することが示されていますが、この効果はTrans4SOARの方が小さく、すべてのデータセットで他のアーキテクチャよりも優れています。 Trans4SOARはさらに、標準のSOARで最先端の技術を生み出し、NTU-120で公開されている最良のアプローチを2.85％上回っています。

Occlusions are universal disruptions constantly present in the real world. Especially for sparse representations, such as human skeletons, a few occluded points might destroy the geometrical and temporal continuity critically affecting the results. Yet, the research of data-scarce recognition from skeleton sequences, such as one-shot action recognition, does not explicitly consider occlusions despite their everyday pervasiveness. In this work, we explicitly tackle body occlusions for Skeleton-based One-shot Action Recognition (SOAR). We mainly consider two occlusion variants: 1) random occlusions and 2) more realistic occlusions caused by diverse everyday objects, which we generate by projecting the existing IKEA 3D furniture models into the camera coordinate system of the 3D skeletons. We leverage the proposed pipeline to blend out portions of skeleton sequences of the three popular action recognition datasets (NTU-120, NTU-60 and Toyota Smart Home) and formalize the first benchmark for SOAR from partially occluded body poses. This is the first benchmark which considers occlusions for data-scarce action recognition. Another key property of our benchmark are the more realistic occlusions generated by everyday objects, as even in standard recognition from 3D skeletons, only randomly missing joints were considered. We re-evaluate state-of-the-art frameworks in the light of this new task and further introduce Trans4SOAR, a new transformer-based model which leverages three data streams and mixed attention fusion mechanism to alleviate the adverse effects caused by occlusions. While our experiments demonstrate a clear decline in accuracy with missing skeleton portions, this effect is smaller with Trans4SOAR, which outperforms other architectures on all datasets. Trans4SOAR additionally yields state-of-the-art in the standard SOAR, surpassing the best published approach by 2.85% on NTU-120.

updated: Wed Jul 13 2022 17:34:01 GMT+0000 (UTC)

published: Wed Feb 23 2022 11:11:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト