ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Junting Pan; Ziyi Lin; Xiatian Zhu; Jing Shao; Hongsheng Li

ST-Adapter: パラメータ効率の高い画像からビデオへの転送学習

関心のあるさまざまなダウンストリームタスクに対して事前にトレーニングされた大規模なモデルを活用することが、最近有望なパフォーマンスで浮上しています。モデルのサイズが増大し続けるため、標準の完全な微調整ベースのタスク適応戦略は、モデルのトレーニングとストレージの点で非常にコストがかかります。これにより、パラメーター効率の高い転移学習における新しい研究の方向性が生まれました。ただし、既存の試みは通常、事前トレーニング済みモデルの同じモダリティ (画像理解など) からのダウンストリームタスクに焦点を当てています。一部の特定のモダリティ (ビデオ理解など) では、十分な知識を備えた強力な事前トレーニング済みモデルが少ないか、利用できないため、これにより制限が生じます。この作業では、このような新しいクロスモダリティ転送学習設定、つまりパラメーター効率の高い画像からビデオへの転送学習を調査します。この問題を解決するために、ビデオタスクごとのパラメーター効率の良い微調整のための新しい時空間アダプター (ST-Adapter) を提案します。 ST-Adapter は、コンパクトな設計に組み込みの時空間推論機能を備えているため、一時的な知識がなくても事前トレーニング済みの画像モデルを使用して、タスクごとのパラメータコストがわずか (約 8%) で、動的なビデオコンテンツについて推論できます。以前の作業と比較して、更新されるパラメーターが 20 倍少なくなりました。ビデオアクション認識タスクに関する広範な実験により、当社の ST アダプターは、パラメータ効率の利点を享受しながら、強力な完全微調整戦略と最先端のビデオモデルに匹敵するか、それ以上のパフォーマンスを発揮できることが示されています。コードとモデルは https://github.com/linziyi96/st-adapter で入手できます

Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes prohibitively costly in terms of model training and storage. This has led to a new research direction in parameter-efficient transfer learning. However, existing attempts typically focus on downstream tasks from the same modality (e.g., image understanding) of the pre-trained model. This creates a limit because in some specific modalities, (e.g., video understanding) such a strong pre-trained model with sufficient knowledge is less or not available. In this work, we investigate such a novel cross-modality transfer learning setting, namely parameter-efficient image-to-video transfer learning. To solve this problem, we propose a new Spatio-Temporal Adapter (ST-Adapter) for parameter-efficient fine-tuning per video task. With a built-in spatio-temporal reasoning capability in a compact design, ST-Adapter enables a pre-trained image model without temporal knowledge to reason about dynamic video content at a small (~8%) per-task parameter cost, requiring approximately 20 times fewer updated parameters compared to previous work. Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-the-art video models, whilst enjoying the advantage of parameter efficiency. The code and model are available at https://github.com/linziyi96/st-adapter

updated: Thu Oct 13 2022 06:34:19 GMT+0000 (UTC)

published: Mon Jun 27 2022 18:02:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト