Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

Guanyu Cai; Yixiao Ge; Binjie Zhang; Alex Jinpeng Wang; Rui Yan; Xudong Lin; Ying Shan; Lianghua He; Xiaohu Qie; Jianping Wu; Mike Zheng Shou

ビデオ言語の民主化のための地域機能の活性化検索の事前トレーニング

ビデオ言語事前トレーニング (VLP) の最近の主要な方法は、生のピクセルから転送可能な表現をエンドツーエンドの方法で学習し、下流のビデオ言語検索で高度なパフォーマンスを実現します。印象的な結果にもかかわらず、VLP 研究は膨大なデータと長いトレーニング時間が必要なため、非常に費用がかかり、それ以上の調査が妨げられています。この作業では、まばらにサンプリングされたビデオクリップの領域機能を再活性化して、空間的および時間的な視覚的冗長性を大幅に削減し、VLP 研究の民主化と同時に最先端の結果を達成します。具体的には、地域の特徴の可能性を完全に探るために、地域と文中の特定の単語との間のきめの細かい関係を適切に最適化する、新しい双方向の地域と単語のアラインメント正則化を導入し、事前に抽出された地域の特徴と文章。 4 つのデータセットに対するダウンストリームのビデオ言語検索タスクの広範な結果は、有効性と効率の両方で私たちの方法の優位性を示しています。たとえば、私たちの方法は、最も効率的な VLP と比較して 80% 少ないデータと 85% 少ない事前トレーニング時間で競合する結果を達成します。今までの方法 lei2021less.コードは https://github.com/showlab/DemoVLP で入手できます。

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval. Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations. In this work, we revitalize region features of sparsely sampled video clips to significantly reduce both spatial and temporal visual redundancy towards democratizing VLP research at the same time achieving state-of-the-art results. Specifically, to fully explore the potential of region features, we introduce a novel bidirectional region-word alignment regularization that properly optimizes the fine-grained relations between regions and certain words in sentences, eliminating the domain/modality disconnections between pre-extracted region features and text. Extensive results of downstream video-language retrieval tasks on four datasets demonstrate the superiority of our method on both effectiveness and efficiency, e.g., our method achieves competing results with 80% fewer data and 85% less pre-training time compared to the most efficient VLP method so far lei2021less. The code will be available at https://github.com/showlab/DemoVLP.

updated: Tue Feb 07 2023 07:54:51 GMT+0000 (UTC)

published: Tue Mar 15 2022 08:18:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト