Video Action Recognition with Attentive Semantic Units

Yifei Chen; Dapeng Chen; Ruijin Liu; Hao Li; Wei Peng

注意セマンティックユニットによるビデオアクション認識

Visual-Language Models (VLM) は、非常に高度なアクションビデオ認識を備えています。アクションラベルのセマンティクスによって監視されている最近の研究では、VLM のビジュアルブランチを適応させてビデオ表現を学習しています。これらの作業によって証明された有効性にもかかわらず、VLM の可能性はまだ十分に活用されていないと考えています。これに照らして、アクションラベルの背後に隠れているセマンティックユニット (SU) を活用し、フレーム内のきめ細かいアイテムとの相関関係を活用して、より正確なアクション認識を実現します。 SU は、ボディパーツ、オブジェクト、シーン、およびモーションを含むアクションセット全体の言語記述から抽出されたエンティティです。ビジュアルコンテンツと SU の間の連携をさらに強化するために、マルチリージョンモジュール (MRA) を VLM のビジュアルブランチに導入します。 MRA により、元のグローバルな機能を超えた、地域を意識した視覚的機能の認識が可能になります。私たちの方法は、フレームの視覚的特徴を備えた関連する SU に適応的に注意を払い、選択します。クロスモーダルデコーダーを使用すると、選択された SU が時空間ビデオ表現をデコードするのに役立ちます。要約すると、媒体としての SU は、識別能力と伝達性を高めることができます。具体的には、完全教師あり学習では、Kinetics-400 で 87.8% のトップ 1 精度を達成しました。 K=2 の数ショット実験では、HMDB-51 と UCF-101 でそれぞれ +7.1% と +15.0% だけ、私たちの方法が以前の最先端技術を上回りました。

Visual-Language Models (VLMs) have significantly advanced action video recognition. Supervised by the semantics of action labels, recent works adapt the visual branch of VLMs to learn video representations. Despite the effectiveness proved by these works, we believe that the potential of VLMs has yet to be fully harnessed. In light of this, we exploit the semantic units (SU) hiding behind the action labels and leverage their correlations with fine-grained items in frames for more accurate action recognition. SUs are entities extracted from the language descriptions of the entire action set, including body parts, objects, scenes, and motions. To further enhance the alignments between visual contents and the SUs, we introduce a multi-region module (MRA) to the visual branch of the VLM. The MRA allows the perception of region-aware visual features beyond the original global feature. Our method adaptively attends to and selects relevant SUs with visual features of frames. With a cross-modal decoder, the selected SUs serve to decode spatiotemporal video representations. In summary, the SUs as the medium can boost discriminative ability and transferability. Specifically, in fully-supervised learning, our method achieved 87.8% top-1 accuracy on Kinetics-400. In K=2 few-shot experiments, our method surpassed the previous state-of-the-art by +7.1% and +15.0% on HMDB-51 and UCF-101, respectively.

updated: Tue Oct 10 2023 13:31:22 GMT+0000 (UTC)

published: Fri Mar 17 2023 03:44:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト