Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Zhenhailong Wang; Manling Li; Ruochen Xu; Luowei Zhou; Jie Lei; Xudong Lin; Shuohang Wang; Ziyi Yang; Chenguang Zhu; Derek Hoiem; Shih-Fu Chang; Mohit Bansal; Heng Ji

画像記述子を備えた言語モデルは、強力な少数のショットのビデオ言語学習者です

この作業の目標は、ドメイン固有のキャプション、質問応答、将来のイベント予測など、いくつかの例からさまざまなビデオからテキストへのタスクに一般化できる柔軟なビデオ言語モデルを構築することです。既存の数ショットのビデオ言語学習者は、エンコーダーのみに焦点を合わせているため、生成タスクを処理するためのビデオからテキストへのデコーダーがありません。ビデオキャプション作成者は、大規模なビデオ言語データセットで事前トレーニングされていますが、微調整に大きく依存しており、数ショットの設定で見えないタスクのテキストを生成する機能がありません。画像および言語モデルを介した数ショットのビデオ言語学習者であるVidILを提案します。これは、ビデオデータセットの事前トレーニングや微調整を必要とせずに、数ショットのビデオからテキストへのタスクで強力なパフォーマンスを発揮します。画像言語モデルを使用して、ビデオコンテンツをフレームのキャプション、オブジェクト、属性、およびイベントフレーズに変換し、それらを時間構造テンプレートに構成します。次に、いくつかのコンテキスト内の例を含むプロンプトを使用して言語モデルに指示し、構成されたコンテンツからターゲット出力を生成します。プロンプトの柔軟性により、モデルは自動音声認識（ASR）トランスクリプトなど、あらゆる形式のテキスト入力をキャプチャできます。私たちの実験は、ビデオキャプション、ビデオ質問応答、ビデオキャプション検索、ビデオの将来のイベント予測など、さまざまなビデオ言語タスクでビデオを理解する上での言語モデルの力を示しています。特に、ビデオの将来のイベント予測では、私たちの数ショットモデルは、大規模なビデオデータセットでトレーニングされた最先端の監視モデルを大幅に上回っています。コードとリソースは、https：//github.com/MikeWangWZHL/VidILで調査目的で公開されています。

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases, and compose them into a temporal structure template. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content. The flexibility of prompting allows the model to capture any form of text input, such as automatic speech recognition (ASR) transcripts. Our experiments demonstrate the power of language models in understanding videos on a wide variety of video-language tasks, including video captioning, video question answering, video caption retrieval, and video future event prediction. Especially, on video future event prediction, our few-shot model significantly outperforms state-of-the-art supervised models trained on large-scale video datasets. Code and resources are publicly available for research purposes at https://github.com/MikeWangWZHL/VidIL .

updated: Thu Oct 13 2022 06:32:37 GMT+0000 (UTC)

published: Sun May 22 2022 05:18:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト