LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

Jinbin Bai; Chunhui Liu; Feiyue Ni; Haofan Wang; Mengying Hu; Xiaofeng Guo; Lele Cheng

LaT：ビデオテキスト検索のためのサイクル一貫性を備えた潜在的な翻訳

ビデオテキスト検索は、クロスモーダル表現学習問題のクラスであり、目的は、特定のテキストクエリと候補ビデオのプールの間のテキストクエリに対応するビデオを選択することです。視覚言語の事前トレーニングの対照的なパラダイムは、大規模なデータセットと統合されたトランスアーキテクチャで有望な成功を示し、共同潜在空間の力を実証しました。それにもかかわらず、視覚領域とテキスト領域の間の本質的な相違はまだ解消されておらず、異なるモダリティを共同の潜在空間に投影すると、単一のモダリティ内の情報が歪む可能性があります。上記の問題を克服するために、視覚領域とテキスト領域の間のギャップを埋める共同潜在空間を必要とせずに、ソースモダリティ空間Sからターゲットモダリティ空間Tへの翻訳関係を学習するための新しいメカニズムを提示します。さらに、変換間のサイクルの一貫性を維持するために、Sから予測ターゲットスペースT'への順方向変換とT'からSへの逆方向変換の両方を含むサイクル損失を採用します。MSR-VTT、MSVD、およびDiDeMoで実施された広範な実験データセットは、バニラの最先端の方法と比較して、LaTアプローチの優位性と有効性を示しています。

Video-text retrieval is a class of cross-modal representation learning problems, where the goal is to select the video which corresponds to the text query between a given text query and a pool of candidate videos. The contrastive paradigm of vision-language pretraining has shown promising success with large-scale datasets and unified transformer architecture, and demonstrated the power of a joint latent space. Despite this, the intrinsic divergence between the visual domain and textual domain is still far from being eliminated, and projecting different modalities into a joint latent space might result in the distorting of the information inside the single modality. To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space S to a target modality space T without the need for a joint latent space, which bridges the gap between visual and textual domains. Furthermore, to keep cycle consistency between translations, we adopt a cycle loss involving both forward translations from S to the predicted target space T', and backward translations from T' back to S. Extensive experiments conducted on MSR-VTT, MSVD, and DiDeMo datasets demonstrate the superiority and effectiveness of our LaT approach compared with vanilla state-of-the-art methods.

updated: Mon Jul 11 2022 13:37:32 GMT+0000 (UTC)

published: Mon Jul 11 2022 13:37:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト