CLIP4Caption: CLIP for Video Caption

Mingkang Tang; Zhanyu Wang; Zhenhua Liu; Fengyun Rao; Dian Li; Xiu Li

CLIP4Caption：ビデオキャプションのCLIP

ビデオキャプションは、さまざまな多様で複雑なビデオを説明する文を生成する必要があるため、困難な作業です。既存のビデオキャプションモデルは、ビデオとテキストの間にギャップが存在することを無視しているため、適切な視覚的表現を欠いています。このギャップを埋めるために、この論文では、CLIPで強化されたビデオテキストマッチングネットワーク（VTM）に基づいてビデオキャプションを改善するCLIP4Captionフレームワークを提案します。このフレームワークは、ビジョンと言語の両方からの情報を最大限に活用し、テキスト生成のためのテキスト相関の強いビデオ機能を学習するようにモデルを強制します。さらに、センテンスデコーダーとしてLSTMまたはGRUを使用するほとんどの既存のモデルとは異なり、トランスフォーマー構造化デコーダーネットワークを採用して、長距離の視覚的および言語依存性を効果的に学習します。さらに、キャプションタスクのための新しいアンサンブル戦略を紹介します。実験結果は、2つのデータセットに対する私たちの方法の有効性を示しています。1）MSR-VTTデータセットでは、私たちの方法は、CIDErで最大10％の大幅な増加を伴う新しい最先端の結果を達成しました。 2）プライベートテストデータでは、ACM MMマルチメディアグランドチャレンジ2021：ビデオ理解チャレンジの事前トレーニングで2位にランクインしているメソッド。私たちのモデルはMSR-VTTデータセットでのみトレーニングされていることに注意してください。

Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps between videos and texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM). This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation. Besides, unlike most existing models using LSTM or GRU as the sentence decoder, we adopt a Transformer structured decoder network to effectively learn the long-range visual and language dependency. Additionally, we introduce a novel ensemble strategy for captioning tasks. Experimental results demonstrate the effectiveness of our method on two datasets: 1) on MSR-VTT dataset, our method achieved a new state-of-the-art result with a significant gain of up to 10% in CIDEr; 2) on the private test data, our method ranking 2nd place in the ACM MM multimedia grand challenge 2021: Pre-training for Video Understanding Challenge. It is noted that our model is only trained on the MSR-VTT dataset.

updated: Wed Oct 13 2021 10:17:06 GMT+0000 (UTC)

published: Wed Oct 13 2021 10:17:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト