Dynamic Appearance: A Video Representation for Action Recognition with Joint Training

Guoxi Huang; Adrian G. Bors

Dynamic Appearance: 共同訓練による行動認識のためのビデオ表現

ビデオの静的な外観は、ディープニューラルネットワークがビデオアクション認識のモーション関連機能を学習する能力を妨げる可能性があります。このホワイトペーパーでは、動的な外観 (DA) という新しい概念を導入します。この概念は、ビデオ内の動きに関連する外観情報を要約し、動きとは無関係と見なされる静的な情報を除外します。効率的なビデオ理解の手段として、生のビデオデータから動的な外観を抽出することを検討します。この目的のために、ピクセル単位の時間投影 (PWTP) を提案します。これは、ビデオの静的な外観を元のベクトル空間内の部分空間に投影し、動的な外観は特別な動きパターンを記述する投影残差にエンコードされます。さらに、PWTPモジュールをCNNまたはTransformerと統合して、多目的最適化アルゴリズムを利用して最適化されたエンドツーエンドのトレーニングフレームワークに統合します。 Kinetics400、Something-Something V1、UCF101、HMDB51 の 4 つのアクション認識ベンチマークに関する広範な実験結果を提供します。

Static appearance of video may impede the ability of a deep neural network to learn motion-relevant features in video action recognition. In this paper, we introduce a new concept, Dynamic Appearance (DA), summarizing the appearance information relating to movement in a video while filtering out the static information considered unrelated to motion. We consider distilling the dynamic appearance from raw video data as a means of efficient video understanding. To this end, we propose the Pixel-Wise Temporal Projection (PWTP), which projects the static appearance of a video into a subspace within its original vector space, while the dynamic appearance is encoded in the projection residual describing a special motion pattern. Moreover, we integrate the PWTP module with a CNN or Transformer into an end-to-end training framework, which is optimized by utilizing multi-objective optimization algorithms. We provide extensive experimental results on four action recognition benchmarks: Kinetics400, Something-Something V1, UCF101 and HMDB51.

updated: Wed Nov 23 2022 07:16:16 GMT+0000 (UTC)

published: Wed Nov 23 2022 07:16:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト