Bi-Calibration Networks for Weakly-Supervised Video Representation Learning

Fuchen Long; Ting Yao; Zhaofan Qiu; Xinmei Tian; Jiebo Luo; Tao Mei

弱教師ありビデオ表現学習のためのバイキャリブレーションネットワーク

検索されたクエリまたは周囲のテキスト（タイトルなど）と組み合わせた大量のWebビデオの活用は、監視されたビデオ表現学習に代わる経済的で拡張可能な代替手段を提供します。それにもかかわらず、そのような弱く視覚的なテキストの接続をモデル化することは、クエリの多義性（つまり、クエリの多くの可能な意味）とテキストの同型（つまり、異なるテキストの同じ構文構造）のために簡単ではありません。この論文では、弱く監視されたビデオ表現学習を後押しするために、クエリとテキストの間の相互キャリブレーションの新しい設計を紹介します。具体的には、2つのキャリブレーションを新たに組み合わせて、テキストからクエリへ、またはその逆の修正を学習するBi-Calibration Networks（BCN）を紹介します。技術的には、BCNは、同一のクエリによって検索されたビデオのすべてのタイトルに対してクラスタリングを実行し、各クラスターの重心をテキストプロトタイプとして取得します。クエリの語彙は、クエリの単語に直接基づいて構築されています。次に、テキストプロトタイプ/クエリボキャブラリに対するビデオからテキスト/ビデオからクエリへの投影は、テキストからクエリまたはクエリからテキストへのキャリブレーションを開始して、クエリまたはテキストの修正を推定します。また、2つの修正のバランスをとるための選択スキームを考案します。各ビデオのクエリとタイトルとペアになっている2つの大規模なWebビデオデータセットが、弱教師ありビデオ表現学習のために新たに収集されます。これらは、それぞれYOVO-3MおよびYOVO-10Mと呼ばれます。 3M Webビデオで学習したBCNのビデオ機能は、ダウンストリームタスクの線形モデルプロトコルの下で優れた結果を取得します。さらに注目すべきことに、さらに微調整を行った10M Webビデオのより大きなセットでトレーニングされたBCNは、Kinetics-400およびSomething-Something V2データセットのトップ1の精度を1.6％、1.8％向上させます。 -アートTDN、およびImageNet事前トレーニングを使用したACTION-Netメソッド。ソースコードとデータセットはhttps://github.com/FuchenUSTC/BCNで入手できます。

The leverage of large volumes of web videos paired with the searched queries or surrounding texts (e.g., title) offers an economic and extensible alternative to supervised video representation learning. Nevertheless, modeling such weakly visual-textual connection is not trivial due to query polysemy (i.e., many possible meanings for a query) and text isomorphism (i.e., same syntactic structure of different text). In this paper, we introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning. Specifically, we present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa. Technically, BCN executes clustering on all the titles of the videos searched by an identical query and takes the centroid of each cluster as a text prototype. The query vocabulary is built directly on query words. The video-to-text/video-to-query projections over text prototypes/query vocabulary then start the text-to-query or query-to-text calibration to estimate the amendment to query or text. We also devise a selection scheme to balance the two corrections. Two large-scale web video datasets paired with query and title for each video are newly collected for weakly-supervised video representation learning, which are named as YOVO-3M and YOVO-10M, respectively. The video features of BCN learnt on 3M web videos obtain superior results under linear model protocol on downstream tasks. More remarkably, BCN trained on the larger set of 10M web videos with further fine-tuning leads to 1.6%, and 1.8% gains in top-1 accuracy on Kinetics-400, and Something-Something V2 datasets over the state-of-the-art TDN, and ACTION-Net methods with ImageNet pre-training. Source code and datasets are available at https://github.com/FuchenUSTC/BCN.

updated: Tue Jun 21 2022 16:02:12 GMT+0000 (UTC)

published: Tue Jun 21 2022 16:02:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト