Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos

Yilin Wen; Hao Pan; Lei Yang; Jia Pan; Taku Komura; Wenping Wang

Egocentric RGB ビデオからの 3D ハンドポーズ推定とアクション認識のための階層的時間変換器

自己中心的な RGB ビデオから動的な手の動きとアクションを理解することは、自己閉塞とあいまいさのため、基本的でありながら困難な作業です。オクルージョンとあいまいさに対処するために、堅牢な推定のために時間情報を活用するためのトランスフォーマーベースのフレームワークを開発します。手のポーズ推定とアクション認識の間の異なる時間的粒度とセマンティックな相関関係に注目して、2 つのカスケード接続されたトランスフォーマーエンコーダーを使用してネットワーク階層を構築します。 -アクションを認識するために、より長い時間スパンでポーズとオブジェクト情報をフレーミングします。私たちのアプローチは、FPHA と H2O という 2 つの一人称ハンドアクションベンチマークで競争力のある結果を達成しています。広範なアブレーション研究により、私たちのデザインの選択が検証されます。将来の研究を容易にするために、コードとデータをオープンソース化します。

Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices. We will open-source code and data to facilitate future research.

updated: Mon Nov 07 2022 02:51:37 GMT+0000 (UTC)

published: Tue Sep 20 2022 05:52:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト