Object-aware Video-language Pre-training for Retrieval

Alex Jinpeng Wang; Yixiao Ge; Guanyu Cai; Rui Yan; Xudong Lin; Ying Shan; Xiaohu Qie; Mike Zheng Shou

検索のためのオブジェクト認識ビデオ言語事前トレーニング

最近、大規模なデータセットと強力なトランスフォーマーネットワークを導入することにより、ビデオ言語の事前トレーニングは、特に検索で大きな成功を収めています。それでも、既存のビデオ言語トランスフォーマーモデルは、明示的にきめ細かいセマンティックアラインメントを行いません。この作業では、オブジェクト認識トランスフォーマーを紹介します。これは、ビデオ言語トランスフォーマーを拡張してオブジェクト表現を組み込むオブジェクト中心のアプローチです。重要なアイデアは、バウンディングボックスとオブジェクトタグを活用してトレーニングプロセスをガイドすることです。広く使用されている4つのベンチマークで、ビデオテキストマッチングの3つの標準サブタスクでモデルを評価します。また、提案された方法に関する詳細な分析と詳細なアブレーションを提供します。検討したすべてのタスクとデータセットでパフォーマンスが明らかに向上していることを示し、オブジェクト表現をビデオ言語アーキテクチャに組み込んだモデルの価値を示しています。コードはhttps://github.com/FingerRec/OA-Transformerでリリースされます。

Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained semantic align. In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations. The key idea is to leverage the bounding boxes and object tags to guide the training process. We evaluate our model on three standard sub-tasks of video-text matching on four widely used benchmarks. We also provide deep analysis and detailed ablation about the proposed method. We show clear improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a video-language architecture. The code will be released at https://github.com/FingerRec/OA-Transformer.

updated: Sat Jan 29 2022 02:49:54 GMT+0000 (UTC)

published: Wed Dec 01 2021 17:06:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト