Tensor Representations for Action Recognition

Piotr Koniusz; Lei Wang; Anoop Cherian

行動認識のためのテンソル表現

ビデオシーケンスにおける人間の行動は、空間的特徴とそれらの時間的ダイナミクスの間の複雑な相互作用によって特徴付けられます。本論文では、行動認識のタスクのための視覚的特徴間のそのような高次の関係をコンパクトにキャプチャするための新しいテンソル表現を提案します。 2つのテンソルベースの特徴表現を提案します。（i）シーケンス互換性カーネル（SCK）および（ii）ダイナミクス互換性カーネル（DCK）。前者は特徴間の時空間相関を利用し、後者はシーケンスのアクションダイナミクスを明示的にモデル化します。また、サブシーケンスを操作して相関のローカル-グローバル相互作用をキャプチャし、ディープラーニングから取得したスケルトン3Dボディジョイントやフレームごとの分類子スコアなどのマルチモーダル入力を組み込むことができるSCK、造語SCK +の一般化についても説明します。ビデオで訓練されたモデル。コンパクトで高速な記述子につながるこれらのカーネルの線形化を紹介します。（i）3Dスケルトンアクションシーケンス、（ii）きめの細かいビデオシーケンス、および（iii）標準のきめの細かいビデオに関する実験を提供します。最終的な表現は、特徴の高次の関係をキャプチャするテンソルであるため、堅牢なきめ細かい認識のための共起に関連しています。高次のテンソルと、高次の発生のスペクトル検出を実行するために長い間推測されてきたいわゆる固有値パワー正規化（EPN）を使用します。したがって、シーン内の特徴を単にカウントするのではなく、特徴のきめ細かい関係を検出します。 Z *から構築された次数rのテンソルが薄暗いことを証明します。 EPNと組み合わせた機能は、実際に、少なくとも1つの高次オカレンスがdimのbinom（Z *、r）部分空間の1つに「投影」されているかどうかを検出します。 rはテンソルで表されます。したがって、このような「検出器」のようなbinom（Z *、r）を備えたテンソルパワー正規化メトリックを形成します。

Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) sequence compatibility kernel (SCK) and (ii) dynamics compatibility kernels (DCK); the former capitalizing on the spatio-temporal correlations between features, while the latter explicitly modeling the action dynamics of a sequence. We also explore generalization of SCK, coined SCK+, that operates on subsequences to capture the local-global interplay of correlations, as well as can incorporate multi-modal inputs e.g., skeleton 3D body-joints and per-frame classifier scores obtained from deep learning models trained on videos. We introduce linearization of these kernels that lead to compact and fast descriptors. We provide experiments on (i) 3D skeleton action sequences, (ii) fine-grained video sequences, and (iii) standard non-fine-grained videos. As our final representations are tensors that capture higher-order relationships of features, they relate to co-occurrences for robust fine-grained recognition. We use higher-order tensors and so-called Eigenvalue Power Normalization (EPN) which have been long speculated to perform spectral detection of higher-order occurrences; thus detecting fine-grained relationships of features rather than merely count features in scenes. We prove that a tensor of order r, built from Z* dim. features, coupled with EPN indeed detects if at least one higher-order occurrence is `projected' into one of its binom(Z*,r) subspaces of dim. r represented by the tensor; thus forming a Tensor Power Normalization metric endowed with binom(Z*,r) such `detectors'.

updated: Mon Dec 28 2020 17:27:18 GMT+0000 (UTC)

published: Mon Dec 28 2020 17:27:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト