CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning

Bang Yang; Yuexian Zou

CLIPがビデオキャプションに出会う：属性を意識した表現学習が正確なキャプションを促進する

ビデオキャプションの場合、「事前トレーニングと微調整」は事実上のパラダイムになりました。通常、ImageNet事前トレーニング（INP）を使用してビデオコンテンツをエンコードし、タスク指向のネットワークを最初から微調整します。キャプションの生成に対処します。 INPを最近提案されたCLIP（Contrastive Language-Image Pre-training）と比較して、このペーパーでは、ビデオキャプションのINPの潜在的な欠陥を調査し、正確な説明を生成するための鍵を探ります。具体的には、INPとCLIPに関する実証的研究では、INPを使用すると、ビデオキャプションモデルが属性のセマンティクスをキャプチャするのが難しく、無関係な背景情報に敏感になることが示されています。対照的に、CLIPのキャプション品質の大幅な向上は、属性を意識した表現学習の重要性を浮き彫りにします。したがって、ビデオコンテンツと属性の間の対応、および属性間の共起関係を学習するためにビデオキャプションモデルを必要とする補助タスクであるデュアル属性予測を導入することに動機付けられています。ベンチマークデータセットでの広範な実験は、私たちのアプローチが属性認識表現のより良い学習を可能にし、異なるアーキテクチャとデコードアルゴリズムを備えたモデルに一貫した改善をもたらすことを示しています。

For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to help encode the video content, and a task-oriented network is fine-tuned from scratch to cope with caption generation. Comparing INP with the recently proposed CLIP (Contrastive Language-Image Pre-training), this paper investigates the potential deficiencies of INP for video captioning and explores the key to generating accurate descriptions. Specifically, our empirical study on INP vs. CLIP shows that INP makes video caption models tricky to capture attributes' semantics and sensitive to irrelevant background information. By contrast, CLIP's significant boost in caption quality highlights the importance of attribute-aware representation learning. We are thus motivated to introduce Dual Attribute Prediction, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes and the co-occurrence relations between attributes. Extensive experiments on benchmark datasets demonstrate that our approach enables better learning of attribute-aware representations, bringing consistent improvements on models with different architectures and decoding algorithms.

updated: Tue Nov 30 2021 06:37:44 GMT+0000 (UTC)

published: Tue Nov 30 2021 06:37:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト