Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors

Lei Wang; Piotr Koniusz

統計的モーメントおよび部分空間記述子による自己監視行動認識

このホワイトペーパーでは、RGBフレームを入力として使用することにより、自己監視の概念に基づいて、アクションの概念と補助記述子（オブジェクト記述子など）の両方を予測する方法を学習します。いわゆる幻覚ストリームは、補助的な手がかりを予測するように訓練され、同時に分類層に供給され、ネットワークを支援するためにテスト段階で幻覚化されます。 2つの記述子を設計および幻覚化します。1つはトレーニングビデオに適用される4つの一般的なオブジェクト検出器を利用し、もう1つは画像レベルおよびビデオレベルの顕著性検出器を利用します。最初の記述子は、検出器およびImageNetごとのクラス予測スコア、信頼スコア、およびバウンディングボックスとフレームインデックスの空間位置をエンコードして、ビデオごとの特徴の時空間分布をキャプチャします。別の記述子は、顕著性マップと強度パターンの空間角度勾配分布をエンコードします。確率分布の特性関数に触発されて、上記の中間記述子で4つの統計的瞬間をキャプチャします。平均、共分散、歪度、尖度の係数の数が特徴ベクトルの次元に対して線形、二次、三次、四次的に増加するにつれて、共分散行列をその先頭のn '固有ベクトル（いわゆる部分空間）で記述し、歪度/をキャプチャします。費用のかかる共分散/尖度ではなく尖度。 CharadesやEPIC-Kitchensなどの5つの人気のあるデータセットで最先端の情報を入手します。

In this paper, we build on a concept of self-supervision by taking RGB frames as input to learn to predict both action concepts and auxiliary descriptors e.g., object descriptors. So-called hallucination streams are trained to predict auxiliary cues, simultaneously fed into classification layers, and then hallucinated at the testing stage to aid network. We design and hallucinate two descriptors, one leveraging four popular object detectors applied to training videos, and the other leveraging image- and video-level saliency detectors. The first descriptor encodes the detector- and ImageNet-wise class prediction scores, confidence scores, and spatial locations of bounding boxes and frame indexes to capture the spatio-temporal distribution of features per video. Another descriptor encodes spatio-angular gradient distributions of saliency maps and intensity patterns. Inspired by the characteristic function of the probability distribution, we capture four statistical moments on the above intermediate descriptors. As numbers of coefficients in the mean, covariance, coskewness and cokurtotsis grow linearly, quadratically, cubically and quartically w.r.t. the dimension of feature vectors, we describe the covariance matrix by its leading n' eigenvectors (so-called subspace) and we capture skewness/kurtosis rather than costly coskewness/cokurtosis. We obtain state of the art on five popular datasets such as Charades and EPIC-Kitchens.

updated: Thu Aug 05 2021 15:25:12 GMT+0000 (UTC)

published: Tue Jan 14 2020 05:03:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト