Diverse Video Captioning Through Latent Variable Expansion

Huanhou Xiao; Jinglun Shi

潜在変数拡張による多様なビデオキャプション

テキストの説明でビデオコンテンツを自動的に説明することは困難ですが重要なタスクであり、コンピュータビジョンコミュニティで多くの注目を集めています。これまでの作品は、人間の行動と矛盾する文の多様性を無視しながら、主に生成された文の正確さを追求しています。この論文では、各ビデオに複数の説明を付けてキャプションを付け、新しいフレームワークを提案することを目指しています。具体的には、所与のビデオについて、従来のエンコード-デコードプロセスの中間潜在変数が、多様な文を生成する目的で、条件付き生成敵対的ネットワーク（CGAN）への入力として利用されます。潜在変数を条件とする記述を生成するジェネレーターと、生成された文の品質を評価するディスクリミネーターとして、さまざまな畳み込みニューラルネットワーク（CNN）を採用しています。同時に、新しいDCEメトリックは、さまざまなキャプションを評価するように設計されています。ベンチマークデータセットでメソッドを評価します。ベンチマークデータセットでは、さまざまな説明を生成する能力が実証されており、他の最先端のメソッドよりも優れた結果が得られます。

Automatically describing video content with text description is challenging but important task, which has been attracting a lot of attention in computer vision community. Previous works mainly strive for the accuracy of the generated sentences, while ignoring the sentences diversity, which is inconsistent with human behavior. In this paper, we aim to caption each video with multiple descriptions and propose a novel framework. Concretely, for a given video, the intermediate latent variables of conventional encode-decode process are utilized as input to the conditional generative adversarial network (CGAN) with the purpose of generating diverse sentences. We adopt different Convolutional Neural Networks (CNNs) as our generator that produces descriptions conditioned on latent variables and discriminator that assesses the quality of generated sentences. Simultaneously, a novel DCE metric is designed to assess the diverse captions. We evaluate our method on the benchmark datasets, where it demonstrates its ability to generate diverse descriptions and achieves superior results against other state-of-the-art methods.

updated: Tue Jun 15 2021 14:50:14 GMT+0000 (UTC)

published: Sat Oct 26 2019 08:34:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト