CLIP4Caption ++: Multi-CLIP for Video Caption

Mingkang Tang; Zhanyu Wang; Zhaoyang Zeng; Fengyun Rao; Dian Li

CLIP4Caption ++：ビデオキャプションのマルチCLIP

このレポートでは、キャプションタスクにおけるVALUEチャレンジ2021のソリューションについて説明します。 CLIP4Caption ++という名前のソリューションは、エンコーダーデコーダーアーキテクチャを備えた高度なモデルであるX-Linear / X-Transformer上に構築されています。提案されたCLIP4Caption ++に次の改善を加えます。高度なエンコーダ-デコーダモデルアーキテクチャX-Transformerをメインフレームワークとして採用し、次の改善を行います。1）3つの強力な事前トレーニング済みCLIPモデルを使用して、テキスト関連の外観を抽出します。視覚的特徴。 2）データ拡張のためにTSNサンプリング戦略を採用しています。 3）より豊富なセマンティック情報を提供するために、ビデオ字幕情報を使用します。 3）視覚的特徴と融合した字幕情報をご紹介します。 4）単語レベルおよび文レベルのアンサンブル戦略を設計します。提案された方法は、VATEX、YC2C、およびTVCデータセットでそれぞれ86.5、148.4、64.5 CIDErスコアを達成します。これは、3つのデータセットすべてで提案されたCLIP4Caption ++の優れたパフォーマンスを示しています。

This report describes our solution to the VALUE Challenge 2021 in the captioning task. Our solution, named CLIP4Caption++, is built on X-Linear/X-Transformer, which is an advanced model with encoder-decoder architecture. We make the following improvements on the proposed CLIP4Caption++: We employ an advanced encoder-decoder model architecture X-Transformer as our main framework and make the following improvements: 1) we utilize three strong pre-trained CLIP models to extract the text-related appearance visual features. 2) we adopt the TSN sampling strategy for data enhancement. 3) we involve the video subtitle information to provide richer semantic information. 3) we introduce the subtitle information, which fuses with the visual features as guidance. 4) we design word-level and sentence-level ensemble strategies. Our proposed method achieves 86.5, 148.4, 64.5 CIDEr scores on VATEX, YC2C, and TVC datasets, respectively, which shows the superior performance of our proposed CLIP4Caption++ on all three datasets.

updated: Thu Oct 14 2021 05:05:39 GMT+0000 (UTC)

published: Mon Oct 11 2021 12:13:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト