Rich Action-semantic Consistent Knowledge for Early Action Prediction

Xiaoli Liu; Jianqin Yin; Di Guo

早期行動予測のための豊富な行動意味一貫性知識

早期行動予測 (EAP) は、進行中のビデオの行動実行の一部から人間の行動を認識することを目的としています。これは、多くの実用的なアプリケーションにとって重要なタスクです。ほとんどの先行研究は、部分的または完全なビデオを全体として扱い、ビデオに隠されている豊富なアクション知識、つまり異なる部分ビデオ間の意味の一貫性を無視しています。対照的に、元の部分的または完全なビデオを分割して新しい一連の部分的なビデオを形成し、任意の進行レベルで進化するこれらの新しい部分的なビデオの間でアクションセマンティックコンシステントナレッジ (ASCK) をマイニングします。さらに、教師と生徒のフレームワークの下での新しいリッチアクションセマンティックコンシステントナレッジネットワーク (RACK) が EAP 用に提案されています。まず、2 ストリームの事前トレーニング済みモデルを使用して、動画の特徴を抽出します。次に、部分的なビデオの RGB またはフロー機能をノードとして扱い、それらのアクションのセマンティックの一貫性をエッジとして扱います。次に、教師ネットワーク用の双方向セマンティックグラフと、学生ネットワーク用の単方向セマンティックグラフを構築して、部分的なビデオ間で豊富な ASCK をモデル化します。 MSE と MMD の損失は、教師から生徒のネットワークへの部分的なビデオの ASCK を強化するための蒸留損失として組み込まれています。最後に、異なるサブネットワークのロジットを合計し、ソフトマックスレイヤーを適用することで、最終的な予測を取得します。広範な実験と除去研究が実施され、EAP の豊富な ASCK のモデル化の有効性が実証されました。提案された RACK を使用して、3 つのベンチマークで最先端のパフォーマンスを達成しました。論文が受理されるとコードが公開されます。

Early action prediction (EAP) aims to recognize human actions from a part of action execution in ongoing videos, which is an important task for many practical applications. Most prior works treat partial or full videos as a whole, ignoring rich action knowledge hidden in videos, i.e., semantic consistencies among different partial videos. In contrast, we partition original partial or full videos to form a new series of partial videos and mine the Action Semantic Consistent Knowledge (ASCK) among these new partial videos evolving in arbitrary progress levels. Moreover, a novel Rich Action-semantic Consistent Knowledge network (RACK) under the teacher-student framework is proposed for EAP. Firstly, we use a two-stream pre-trained model to extract features of videos. Secondly, we treat the RGB or flow features of the partial videos as nodes and their action semantic consistencies as edges. Next, we build a bi-directional semantic graph for the teacher network and a single-directional semantic graph for the student network to model rich ASCK among partial videos. The MSE and MMD losses are incorporated as our distillation loss to enrich the ASCK of partial videos from the teacher to the student network. Finally, we obtain the final prediction by summering the logits of different sub-networks and applying a softmax layer. Extensive experiments and ablative studies have been conducted, demonstrating the effectiveness of modeling rich ASCK for EAP. With the proposed RACK, we have achieved state-of-the-art performance on three benchmarks. The code will be released if the paper is accepted.

updated: Fri Jan 20 2023 02:57:36 GMT+0000 (UTC)

published: Sun Jan 23 2022 03:39:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト