On Pursuit of Designing Multi-modal Transformer for Video Grounding

Meng Cao; Long Chen; Mike Zheng Shou; Can Zhang; Yuexian Zou

ビデオ接地用のマルチモーダル変圧器の設計の追求について

ビデオグラウンディングは、トリミングされていないビデオからのセンテンスクエリに対応する時間セグメントをローカライズすることを目的としています。ほとんどすべての既存のビデオ接地方法は、2つのフレームワークに分類されます。1）トップダウンモデル：セグメント候補のセットを事前定義してから、セグメントの分類と回帰を実行します。 2）ボトムアップモデル：参照セグメント境界のフレームごとの確率を直接予測します。ただし、これらの方法はすべてエンドツーエンドではありません。つまり、予測を改善するために、時間のかかる後処理ステップに常に依存しています。この目的のために、ビデオグラウンディングをセット予測タスクとして再定式化し、GTRと呼ばれる新しいエンドツーエンドのマルチモーダルトランスフォーマーモデルを提案します。具体的には、GTRには、ビデオおよび言語エンコーディング用の2つのエンコーダーと、接地予測用のクロスモーダルデコーダーがあります。エンドツーエンドのトレーニングを容易にするために、キュービック埋め込みレイヤーを使用して、生のビデオを一連のビジュアルトークンに変換します。デコーダーでこれら2つのモダリティをより適切に融合するために、新しいマルチヘッドクロスモーダルアテンションを設計します。 GTR全体は、多対1のマッチング損失によって最適化されます。さらに、さまざまなモデル設計の選択を調査するための包括的な調査を実施します。 3つのベンチマークでの広範な結果により、GTRの優位性が検証されました。 3つの典型的なGTRバリアントはすべて、すべてのデータセットとメトリックで記録的なパフォーマンスを実現し、推論速度は数倍速くなります。

Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video. Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression. 2) Bottom-up model: It directly predicts frame-wise probabilities of the referential segment boundaries. However, all these methods are not end-to-end, i.e. , they always rely on some time-consuming post-processing steps to refine predictions. To this end, we reformulate video grounding as a set prediction task and propose a novel end-to-end multi-modal Transformer model, dubbed as GTR. Specifically, GTR has two encoders for video and language encoding, and a cross-modal decoder for grounding prediction. To facilitate the end-to-end training, we use a Cubic Embedding layer to transform the raw videos into a set of visual tokens. To better fuse these two modalities in the decoder, we design a new Multi-head Cross-Modal Attention. The whole GTR is optimized via a Many-to-One matching loss. Furthermore, we conduct comprehensive studies to investigate different model design choices. Extensive results on three benchmarks have validated the superiority of GTR. All three typical GTR variants achieve record-breaking performance on all datasets and metrics, with several times faster inference speed.

updated: Mon Sep 13 2021 16:01:19 GMT+0000 (UTC)

published: Mon Sep 13 2021 16:01:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト