Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Antoine Yang; Arsha Nagrani; Paul Hongsuck Seo; Antoine Miech; Jordi Pont-Tuset; Ivan Laptev; Josef Sivic; Cordelia Schmid

Vid2Seq: 高密度ビデオキャプションのためのビジュアル言語モデルの大規模事前トレーニング

この作業では、Vid2Seq を紹介します。これは、大規模にすぐに利用できるナレーション付きビデオで事前トレーニングされた、マルチモーダルな単一ステージの高密度イベントキャプションモデルです。 Vid2Seq アーキテクチャは、言語モデルを特別な時間トークンで補強し、同じ出力シーケンスでイベント境界とテキスト記述をシームレスに予測できるようにします。このような統合モデルには、現在の注釈付きデータセットでは利用できない大規模なトレーニングデータが必要です。転写された音声の文の境界を疑似イベント境界として再定式化し、転写された音声文を疑似イベントのキャプションとして使用することにより、ラベルのないナレーション付きビデオを高密度のビデオキャプションに活用できることを示します。 YT-Temporal-1B データセットで事前トレーニングされた結果の Vid2Seq モデルは、YouCook2、ViTT、ActivityNet Captions など、さまざまな密度の高いビデオキャプションベンチマークの最先端を向上させます。 Vid2Seq はまた、ビデオパラグラフキャプションおよびビデオクリップキャプションのタスク、および少数ショットの設定にも一般化されます。コードは https://antoyang.github.io/vid2seq.html で公開されています。

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, which is not available in current annotated datasets. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries, and using the transcribed speech sentences as pseudo event captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the tasks of video paragraph captioning and video clip captioning, and to few-shot settings. Our code is publicly available at https://antoyang.github.io/vid2seq.html.

updated: Tue Mar 21 2023 11:01:09 GMT+0000 (UTC)

published: Mon Feb 27 2023 19:53:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト