Delving Deep into One-Shot Skeleton-based Action Recognition with Diverse Occlusions

Kunyu Peng; Alina Roitberg; Kailun Yang; Jiaming Zhang; Rainer Stiefelhagen

多様なオクルージョンを使用したワンショットスケルトンベースのアクション認識を深く掘り下げる

オクルージョンは、現実世界に常に存在する普遍的な混乱です。特に人間の骨格などのまばらな表現の場合、いくつかのオクルードポイントが幾何学的および時間的な連続性を破壊する可能性があり、結果に重大な影響を与えます。しかし、ワンショットアクション認識など、スケルトンシーケンスからのデータ不足の認識に関する研究では、オクルージョンが日常的に普及しているにもかかわらず、オクルージョンを明示的に考慮していません。この作業では、スケルトンベースのワンショットアクション認識 (SOAR) のボディオクルージョンに明示的に取り組みます。主に 2 つのオクルージョンバリアントを検討します。1) ランダムオクルージョンと 2) さまざまな日常のオブジェクトによって引き起こされるよりリアルなオクルージョンです。これらは、既存の IKEA 3D 家具モデルを、さまざまな幾何学的パラメーターを持つ 3D スケルトンのカメラ座標系に投影することによって生成されます。提案されたパイプラインを活用して、3 つの一般的なアクション認識データセットのスケルトンシーケンスの一部をブレンドし、部分的に遮られたボディポーズから SOAR の最初のベンチマークを形式化します。ベンチマークのもう 1 つの重要な特性は、3D スケルトンからの標準的な認識でさえ、ランダムに失われた関節のみが考慮されたため、日常のオブジェクトによって生成されるより現実的なオクルージョンです。この新しいタスクに照らして、SOAR の既存の最先端のフレームワークを再評価し、Trans4SOAR をさらに導入します。これは、3 つのデータストリームと混合注意融合メカニズムを活用して、閉塞。私たちの実験では、骨格部分が欠けていると精度が明らかに低下することが示されていますが、この影響は Trans4SOAR では小さく、すべてのデータセットで他のアーキテクチャよりも優れています。私たちは特にオクルージョンに焦点を当てていますが、Trans4SOAR はさらに、オクルージョンのない標準 SOAR で最先端を生み出し、NTU-120 で公開されている最良のアプローチを 2.85% 上回っています。

Occlusions are universal disruptions constantly present in the real world. Especially for sparse representations, such as human skeletons, a few occluded points might destroy the geometrical and temporal continuity critically affecting the results. Yet, the research of data-scarce recognition from skeleton sequences, such as one-shot action recognition, does not explicitly consider occlusions despite their everyday pervasiveness. In this work, we explicitly tackle body occlusions for Skeleton-based One-shot Action Recognition (SOAR). We mainly consider two occlusion variants: 1) random occlusions and 2) more realistic occlusions caused by diverse everyday objects, which we generate by projecting the existing IKEA 3D furniture models into the camera coordinate system of the 3D skeletons with different geometric parameters. We leverage the proposed pipeline to blend out portions of skeleton sequences of the three popular action recognition datasets and formalize the first benchmark for SOAR from partially occluded body poses. Another key property of our benchmark are the more realistic occlusions generated by everyday objects, as even in standard recognition from 3D skeletons, only randomly missing joints were considered. We re-evaluate existing state-of-the-art frameworks for SOAR in the light of this new task and further introduce Trans4SOAR - a new transformer-based model which leverages three data streams and mixed attention fusion mechanism to alleviate the adverse effects caused by occlusions. While our experiments demonstrate a clear decline in accuracy with missing skeleton portions, this effect is smaller with Trans4SOAR, which outperforms other architectures on all datasets. Although we specifically focus on occlusions, Trans4SOAR additionally yields state-of-the-art in the standard SOAR without occlusion, surpassing the best published approach by 2.85% on NTU-120.

updated: Mon Jan 09 2023 20:55:30 GMT+0000 (UTC)

published: Wed Feb 23 2022 11:11:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト