MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Yuying Ge; Yixiao Ge; Xihui Liu; Alex Jinpeng Wang; Jianping Wu; Ying Shan; Xiaohu Qie; Ping Luo

MILES：ビデオテキスト検索のための注入された言語セマンティクスによるビジュアルBERT事前トレーニング

ビデオテキスト検索の主な事前トレーニング作業は、主に「デュアルエンコーダ」アーキテクチャを採用して効率的な検索を可能にします。このアーキテクチャでは、2つの別々のエンコーダを使用して、グローバルなビデオ表現とテキスト表現を対比しますが、詳細なローカルセマンティクスは無視します。ローカル視覚コンテキストの学習を促進するマスクされた視覚モデリングを使用した画像BERT事前トレーニングの最近の成功は、上記の制限に対処するための可能な解決策を動機付けています。この作業では、「デュアルエンコーダ」アーキテクチャを使用したビデオテキストの事前トレーニングにおけるマスクされたビジュアルモデリングを初めて調査します。マスクされたビデオパッチ予測の再構成ターゲットを生成するための進化する「トークン化」として追加のスナップショットビデオエンコーダーを採用することにより、Injected LanguagEセマンティクス（MILES）を使用してマスクされたビジュアルモデリングを実行します。破損したビデオが与えられると、ビデオエンコーダーは、空間的および時間的次元に沿った可視領域で推論することにより、マスクされたパッチのテキスト整列特徴を回復するようにトレーニングされます。。私たちの方法は、ゼロショットと微調整の両方の評価プロトコルを使用して、4つのデータセットでテキストからビデオへの検索を行うための最先端の方法よりも優れています。私たちのアプローチはまた、ビデオからテキストへの検索としてキャストできるゼロショットアクション認識でベースラインモデルを大幅に上回っています。

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhances the discriminativeness of local visual features and the fine-grained cross-modality alignment. Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols. Our approach also surpasses the baseline models significantly on zero-shot action recognition, which can be cast as video-to-text retrieval.

updated: Tue Apr 26 2022 16:06:31 GMT+0000 (UTC)

published: Tue Apr 26 2022 16:06:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト