Tensor Representations for Action Recognition

Piotr Koniusz; Lei Wang; Anoop Cherian

アクション認識のためのテンソル表現

ビデオシーケンスにおける人間の行動は、空間的特徴とそれらの時間的ダイナミクスの間の複雑な相互作用によって特徴付けられます。本論文では、行動認識のタスクのための視覚的特徴間のそのような高次の関係をコンパクトにキャプチャするための新しいテンソル表現を提案します。 2つのテンソルベースの特徴表現を提案します。（i）シーケンス互換性カーネル（SCK）および（ii）ダイナミクス互換性カーネル（DCK）。 SCKは機能間の時空間相関に基づいて構築されますが、DCKはシーケンスのアクションダイナミクスを明示的にモデル化します。また、サブシーケンスを操作して相関のローカル-グローバル相互作用をキャプチャするSCKの一般化についても説明します。これには、スケルトン3Dボディジョイントやディープから取得したフレームごとの分類子スコアなどのマルチモーダル入力を組み込むことができます。ビデオでトレーニングされた学習モデル。コンパクトで高速な記述子につながるこれらのカーネルの線形化を紹介します。（i）3Dスケルトンアクションシーケンス、（ii）きめの細かいビデオシーケンス、および（iii）標準のきめの細かいビデオに関する実験を提供します。最終的な表現は、特徴の高次の関係をキャプチャするテンソルであるため、堅牢なきめ細かい認識のための共起に関連しています。高次の発生のスペクトル検出を実行するために長い間推測されてきた高次テンソルといわゆる固有値パワー正規化（EPN）を使用して、アクションシーケンス内の特徴を単にカウントするのではなく、特徴のきめ細かい関係を検出します。 Z *次元の特徴から構築され、EPNと結合された次数rのテンソルが、少なくとも1つの高次の発生がdimのbinom（Z *、r）部分空間の1つに「投影」されているかどうかを実際に検出することを証明します。 rはテンソルで表され、したがって、「検出器」などのbinom（Z *、r）を備えたテンソルパワー正規化メトリックを形成します。

Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) sequence compatibility kernel (SCK) and (ii) dynamics compatibility kernel (DCK). SCK builds on the spatio-temporal correlations between features, whereas DCK explicitly models the action dynamics of a sequence. We also explore generalization of SCK, coined SCK(+), that operates on subsequences to capture the local-global interplay of correlations, which can incorporate multi-modal inputs e.g., skeleton 3D body-joints and per-frame classifier scores obtained from deep learning models trained on videos. We introduce linearization of these kernels that lead to compact and fast descriptors. We provide experiments on (i) 3D skeleton action sequences, (ii) fine-grained video sequences, and (iii) standard non-fine-grained videos. As our final representations are tensors that capture higher-order relationships of features, they relate to co-occurrences for robust fine-grained recognition. We use higher-order tensors and so-called Eigenvalue Power Normalization (EPN) which have been long speculated to perform spectral detection of higher-order occurrences, thus detecting fine-grained relationships of features rather than merely count features in action sequences. We prove that a tensor of order r, built from Z* dimensional features, coupled with EPN indeed detects if at least one higher-order occurrence is `projected' into one of its binom(Z*,r) subspaces of dim. r represented by the tensor, thus forming a Tensor Power Normalization metric endowed with binom(Z*,r) such `detectors'.

updated: Sat Aug 28 2021 17:35:50 GMT+0000 (UTC)

published: Mon Dec 28 2020 17:27:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト