Learning Video Representations from Large Language Models

Yue Zhao; Ishan Misra; Philipp Krähenbühl; Rohit Girdhar

大規模言語モデルからのビデオ表現の学習

大規模言語モデル (LLM) を活用してビデオ言語表現を学習する新しいアプローチである LaViLa を紹介します。事前トレーニング済みの LLM を再利用して、視覚的な入力を条件とし、微調整して自動ビデオナレーターを作成します。当社の自動生成されたナレーションには、長いビデオを高密度にカバーする、視覚情報とテキストの時間的な同期を改善する、テキストの多様性を高めるなど、多くの利点があります。これらの追加の自動生成されたナレーションとは対照的に学習されたビデオテキストの埋め込みは、複数の一人称および三人称ビデオタスクで、ゼロショットおよび微調整セットアップの両方で、以前の最先端技術よりも優れています。最も注目すべきは、LaViLa が EGTEA 分類で 10.1%、Epic-Kitchens-100 マルチインスタンス検索ベンチマークで 5.9% の絶対ゲインを獲得したことです。さらに、Ego4D データセットの半分のナレーションのみでトレーニングされた LaViLa は、フルセットでトレーニングされたベースラインモデルよりも優れており、トレーニング前のデータとモデルのサイズが大きくなると、正のスケーリング動作を示します。

We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.

updated: Thu Dec 08 2022 18:59:59 GMT+0000 (UTC)

published: Thu Dec 08 2022 18:59:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト