Diverse Video Captioning by Adaptive Spatio-temporal Attention

Zohreh Ghaderi; Leonard Salewski; Hendrik P. A. Lensch

適応時空間的注意による多様なビデオキャプション

ビデオの適切なキャプションを生成するには、関連する概念を特定し、それらの間の空間的な関係とクリップの時間的な展開に注意を払う必要があります。当社のエンドツーエンドのエンコーダー/デコーダービデオキャプションフレームワークには、2 つのトランスフォーマーベースのアーキテクチャが組み込まれています。単一の時空間ビデオ分析に適合したトランスフォーマーと、高度なテキスト生成用の自己注意ベースのデコーダーです。さらに、両方のトランスフォーマーをトレーニングするときに関連するコンテンツを維持しながら、必要な受信フレームの数を減らすための適応フレーム選択スキームを導入します。さらに、各サンプルのすべてのグラウンドトゥルースキャプションを集約することにより、ビデオキャプションに関連するセマンティックコンセプトを推定します。私たちのアプローチは、MSVD、大規模な MSR-VTT、および複数の自然言語生成 (NLG) メトリックを考慮した VATEX ベンチマークデータセットで最先端の結果を達成します。多様性スコアに関する追加の評価は、生成されたキャプションの構造における表現力と多様性を際立たせます。

To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets considering multiple Natural Language Generation (NLG) metrics. Additional evaluations on diversity scores highlight the expressiveness and diversity in the structure of our generated captions.

updated: Fri Aug 19 2022 11:21:59 GMT+0000 (UTC)

published: Fri Aug 19 2022 11:21:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト