AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Yulin Wang; Yang Yue; Yuanze Lin; Haojun Jiang; Zihang Lai; Victor Kulikov; Nikita Orlov; Humphrey Shi; Gao Huang

AdaFocus V2：ビデオ認識のための空間動的ネットワークのエンドツーエンドトレーニング

最近の研究では、空間的な冗長性を減らすことで、ビデオ認識の計算効率を大幅に向上させることができることが示されています。代表的な研究として、アダプティブフォーカス法（AdaFocus）は、各ビデオフレームの有益な領域を動的に識別してそれに注意を向けることにより、精度と推論速度の間の好ましいトレードオフを達成しました。ただし、AdaFocusは複雑な3段階のトレーニングパイプライン（強化学習を含む）を必要とするため、収束が遅くなり、開業医には不向きです。この作業は、微分可能な補間ベースのパッチ選択操作を導入することにより、AdaFocusのトレーニングを単純な1段階のアルゴリズムとして再定式化し、効率的なエンドツーエンドの最適化を可能にします。さらに、監督の欠如、入力の多様性、トレーニングの安定性など、1段階の定式化によって導入された問題に対処するための改善されたトレーニングスキームを提示します。さらに、追加のトレーニングなしでAdaFocus上で時間適応計算を実行するための条件付き終了手法が提案されています。 6つのベンチマークデータセット（つまり、ActivityNet、FCVID、Mini-Kinetics、Something-Something V1＆V2、およびJester）での広範な実験は、トレーニングがかなり単純で効率的でありながら、モデルが元のAdaFocusおよびその他の競合ベースラインを大幅に上回っていることを示しています。コードはhttps://github.com/LeapLabTHU/AdaFocusV2で入手できます。

Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfriendly to practitioners. This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. We further present an improved training scheme to address the issues introduced by the one-stage formulation, including the lack of supervision, input diversity and training stability. Moreover, a conditional-exit technique is proposed to perform temporal adaptive computation on top of AdaFocus without additional training. Extensive experiments on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, and Jester) demonstrate that our model significantly outperforms the original AdaFocus and other competitive baselines, while being considerably more simple and efficient to train. Code is available at https://github.com/LeapLabTHU/AdaFocusV2.

updated: Tue Apr 12 2022 02:44:14 GMT+0000 (UTC)

published: Tue Dec 28 2021 17:53:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト