Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

Shreyank N Gowda; Anurag Arnab; Jonathan Huang

ViViT トレーニングの最適化: 動作認識の時間とメモリの削減

このペーパーでは、アクション認識タスクのベースラインとして ViViT (Video Vision Transformer) モデル、特に Factorized Encoder バージョンに焦点を当て、ビデオトランスフォーマーに関連するかなりのトレーニング時間とメモリ消費によってもたらされる課題に取り組みます。因数分解されたエンコーダーのバリアントは、多くの最先端のアプローチで採用されている後期融合アプローチに従います。 ViViT のさまざまなバリエーションの中で速度と精度の有利なトレードオフが際立っているにもかかわらず、その相当なトレーニング時間とメモリ要件が依然として参入への大きな障壁となっています。私たちの方法はこの障壁を下げるように設計されており、トレーニング中に空間変換器をフリーズするという考えに基づいています。これを単純に行うと、モデルの精度が低くなります。しかし、我々は、(1) 時間変換器 (時間情報の処理を担当するモジュール) を適切に初期化する (2) 凍結された空間表現 (入力画像の領域に選択的に焦点を合わせるモジュール) を接続するコンパクトなアダプターモデルを導入することによって、時間変換器を使用すると、精度を犠牲にすることなく空間変換器をフリーズする利点を享受できます。6 つのベンチマークにわたる広範な実験を通じて、提案したトレーニング戦略が、パフォーマンスを維持またはわずかに向上させながら、トレーニングコスト (約 50%) とメモリ消費量を大幅に削減することを実証しました。私たちのアプローチはさらに、より大きなイメージトランスフォーマーモデルを空間トランスフォーマーとして利用し、同じメモリ消費量でより多くのフレームにアクセスできる機能を解放します。

In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations ((a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by ∼50%) and memory consumption while maintaining or slightly improving performance by up to 1.79% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption.

updated: Wed Jun 07 2023 23:06:53 GMT+0000 (UTC)

published: Wed Jun 07 2023 23:06:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト