CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

Bang Yang; Tong Zhang; Yuexian Zou

CLIP とビデオキャプションの出会い: 概念を意識した表現の学習が重要

ビデオキャプションでは、「事前トレーニングと微調整」が事実上のパラダイムになりました。通常、ImageNet 事前トレーニング (INP) を使用してビデオコンテンツをエンコードし、次にタスク指向ネットワークをゼロから微調整して、キャプション生成に対応。このホワイトペーパーでは、まず、最近提案された CLIP (Contrasive Language-Image Pre-training) がビデオキャプションに与える影響を調査します。 INP と CLIP の実証研究を通じて、INP の潜在的な欠陥を特定し、正確な説明を生成するための重要な要因を探ります。結果は、INP ベースのモデルが概念のセマンティクスを捉えるのが難しく、無関係な背景情報に敏感であることを示しています。対照的に、CLIP ベースのモデルはキャプションの品質を大幅に改善し、概念を意識した表現学習の重要性を強調しています。これらの調査結果により、トレーニング中にモデルに概念知識を注入するために、デュアルコンセプト検出 (DCD) をさらに提案します。 DCD は、映像コンテンツと概念の対応関係や概念間の共起関係を学習するキャプションモデルを必要とする補助的なタスクです。 MSR-VTT と VATEX の実験は DCD の有効性を実証し、視覚化の結果は、概念を意識した表現を学習する必要性をさらに明らかにします。

For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to encode the video content, then a task-oriented network is fine-tuned from scratch to cope with caption generation. This paper first investigates the impact of the recently proposed CLIP (Contrastive Language-Image Pre-training) on video captioning. Through the empirical study on INP vs. CLIP, we identify the potential deficiencies of INP and explore the key factors for accurate description generation. The results show that the INP-based model is tricky to capture concepts' semantics and sensitive to irrelevant background information. By contrast, the CLIP-based model significantly improves the caption quality and highlights the importance of concept-aware representation learning. With these findings, we propose Dual Concept Detection (DCD) further to inject concept knowledge into the model during training. DCD is an auxiliary task that requires a caption model to learn the correspondence between video content and concepts and the co-occurrence relations between concepts. Experiments on MSR-VTT and VATEX demonstrate the effectiveness of DCD, and the visualization results further reveal the necessity of learning concept-aware representations.

updated: Sun Aug 21 2022 15:07:11 GMT+0000 (UTC)

published: Tue Nov 30 2021 06:37:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト