Long-range Multimodal Pretraining for Movie Understanding

Dawit Mureja Argaw; Joon-Young Lee; Markus Woodson; In So Kweon; Fabian Caba Heilbron

映画理解のための長距離マルチモーダル事前トレーニング

映画から (そして映画のために) コンピュータービジョンモデルを学習することには長い歴史があります。大きな進歩が達成されましたが、コミュニティが確立し続けている、増え続ける映画理解タスクで適切に実行できる、事前トレーニング済みのマルチモーダルモデルが依然として必要とされています。この研究では、長距離マルチモーダル事前トレーニング、戦略、および映画データを活用して転送可能なマルチモーダルおよびクロスモーダルエンコーダーをトレーニングするモデルを紹介します。私たちの重要なアイデアは、長期にわたる関係を観察して抽出することで、映画内のあらゆるモダリティから学ぶことです。事前トレーニング後、LVU ベンチマークでアブレーション研究を実行し、モデリングの選択と長距離のタイムスパンから学習することの重要性を検証します。私たちのモデルは、以前の作品よりもはるかにデータ効率が高く、いくつかの LVU タスクで最先端の機能を実現しています。最後に、5 つの異なるベンチマークで新しい最先端の値を設定することにより、モデルの移転可能性を評価します。

Learning computer vision models from (and for) movies has a long-standing history. While great progress has been attained, there is still a need for a pretrained multimodal model that can perform well in the ever-growing set of movie understanding tasks the community has been establishing. In this work, we introduce Long-range Multimodal Pretraining, a strategy, and a model that leverages movie data to train transferable multimodal and cross-modal encoders. Our key idea is to learn from all modalities in a movie by observing and extracting relationships over a long-range. After pretraining, we run ablation studies on the LVU benchmark and validate our modeling choices and the importance of learning from long-range time spans. Our model achieves state-of-the-art on several LVU tasks while being much more data efficient than previous works. Finally, we evaluate our model's transferability by setting a new state-of-the-art in five different benchmarks.

updated: Fri Aug 18 2023 18:52:59 GMT+0000 (UTC)

published: Fri Aug 18 2023 18:52:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト