YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time Spatio-temporal Action Detection

Jianhua Yang; Kun Dai

YOWOv2: リアルタイムの時空間アクション検出のための、より強力かつ効率的なマルチレベル検出フレームワーク

時空間アクション検出タスクのリアルタイムフレームワークを設計することは、依然として課題です。この論文では、新しいリアルタイムアクション検出フレームワーク、YOWOv2 を提案します。この新しいフレームワークでは、YOWOv2 は 3D バックボーンと 2D バックボーンの両方を利用して、正確なアクション検出を行います。マルチレベル検出パイプラインは、さまざまなスケールのアクションインスタンスを検出するように設計されています。この目標を達成するために、特徴ピラミッドネットワークを使用してシンプルで効率的な 2D バックボーンを慎重に構築し、さまざまなレベルの分類特徴と回帰特徴を抽出します。 3D バックボーンには、既存の効率的な 3D CNN を採用して開発時間を節約します。サイズの異なる 3D バックボーンと 2D バックボーンを組み合わせることで、YOWOv2-Tiny、YOWOv2-Medium、YOWOv2-Large を含む YOWOv2 ファミリーを設計します。また、YOWOv2 を高度なモデルアーキテクチャ設計と一致させるために、一般的な動的ラベル割り当て戦略とアンカーフリーメカニズムも紹介します。私たちの改善により、YOWOv2 は YOWO よりも大幅に優れており、リアルタイムの検出を維持できます。追加機能なしで、YOWOv2 は UCF101-24 で 20 FPS を超える 87.0 % のフレーム mAP と 52.8 % のビデオ mAP を達成します。 AVA では、YOWOv2 は 20 FPS 以上で 21.7 % のフレーム mAP を達成します。コードは https://github.com/yjh0410/YOWOv2 で入手できます。

Designing a real-time framework for the spatio-temporal action detection task is still a challenge. In this paper, we propose a novel real-time action detection framework, YOWOv2. In this new framework, YOWOv2 takes advantage of both the 3D backbone and 2D backbone for accurate action detection. A multi-level detection pipeline is designed to detect action instances of different scales. To achieve this goal, we carefully build a simple and efficient 2D backbone with a feature pyramid network to extract different levels of classification features and regression features. For the 3D backbone, we adopt the existing efficient 3D CNN to save development time. By combining 3D backbones and 2D backbones of different sizes, we design a YOWOv2 family including YOWOv2-Tiny, YOWOv2-Medium, and YOWOv2-Large. We also introduce the popular dynamic label assignment strategy and anchor-free mechanism to make the YOWOv2 consistent with the advanced model architecture design. With our improvement, YOWOv2 is significantly superior to YOWO, and can still keep real-time detection. Without any bells and whistles, YOWOv2 achieves 87.0 % frame mAP and 52.8 % video mAP with over 20 FPS on the UCF101-24. On the AVA, YOWOv2 achieves 21.7 % frame mAP with over 20 FPS. Our code is available on https://github.com/yjh0410/YOWOv2.

updated: Thu Jun 08 2023 01:49:33 GMT+0000 (UTC)

published: Tue Feb 14 2023 05:52:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト