Zero-Shot Video Captioning with Evolving Pseudo-Tokens

Yoad Tewel; Yoav Shalev; Roy Nadler; Idan Schwartz; Lior Wolf

進化する疑似トークンを使用したゼロショットビデオキャプション

GPT-2言語モデルとCLIP画像テキストマッチングモデルの2つのフリーズネットワークを採用したゼロショットビデオキャプション方式を紹介します。マッチングスコアは、ビデオフレームのサブセットに対して高い平均マッチングスコアを持つ文を生成するように言語モデルを操作するために使用されます。ゼロショット画像のキャプション方法とは異なり、私たちの仕事は文全体を一度に考慮します。これは、生成プロセス中にプロンプトの一部を最初から最適化し、プロンプト内の他のすべてのトークンの表現を変更し、プロセスを繰り返し繰り返して、生成された文の特異性と包括性を徐々に改善することによって実現されます。私たちの実験は、生成されたキャプションが首尾一貫しており、実世界の幅広い知識を表示していることを示しています。私たちのコードはhttps://github.com/YoadTew/zero-shot-video-to-textで入手できます。

We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model. The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames. Unlike zero-shot image captioning methods, our work considers the entire sentence at once. This is achieved by optimizing, during the generation process, part of the prompt from scratch, by modifying the representation of all other tokens in the prompt, and by repeating the process iteratively, gradually improving the specificity and comprehensiveness of the generated sentence. Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge. Our code is available at: https://github.com/YoadTew/zero-shot-video-to-text

updated: Wed Jul 27 2022 21:52:21 GMT+0000 (UTC)

published: Fri Jul 22 2022 14:19:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト