VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

Xilun Chen; Lili Yu; Wenhan Xiong; Barlas Oğuz; Yashar Mehdad; Wen-tau Yih

VideoOFA: ビデオからテキストへの生成のための 2 段階の事前トレーニング

ビデオのキャプションやビデオの質問応答など、ビデオからテキストへの生成タスクのための新しい 2 段階の事前トレーニングフレームワークを提案します。まず、生成エンコーダー/デコーダーモデルを大量の画像テキストデータで事前トレーニングして、基本的なビジョンを学習します。 -言語の概念を学習し、中間のビデオテキスト事前トレーニング段階でビデオデータに適応させて、時空間推論などのビデオ固有のスキルを学習します。その結果、当社の VideoOFA モデルは、4 つのビデオキャプションベンチマークで新しい最先端のパフォーマンスを達成し、CIDEr スコアで従来技術を平均 9.7 ポイント上回りました。また、2 つの制限のないビデオ質問応答データセットで既存のモデルよりも優れており、普遍的なビデオからテキストへのモデルとしての一般化機能を示しています。

We propose a new two-stage pre-training framework for video-to-text generation tasks such as video captioning and video question answering: A generative encoder-decoder model is first jointly pre-trained on massive image-text data to learn fundamental vision-language concepts, and then adapted to video data in an intermediate video-text pre-training stage to learn video-specific skills such as spatio-temporal reasoning. As a result, our VideoOFA model achieves new state-of-the-art performance on four Video Captioning benchmarks, beating prior art by an average of 9.7 points in CIDEr score. It also outperforms existing models on two open-ended Video Question Answering datasets, showcasing its generalization capability as a universal video-to-text model.

updated: Thu May 04 2023 23:27:21 GMT+0000 (UTC)

published: Thu May 04 2023 23:27:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト